|
High quality revenue forecasting models use time series regression analysis to estimate the relationship between "explanatory" variables and a "dependent" variable. The explanatory variables can be nearly any economic indicator or other factor that could impact the dependent variable, but should be selected from the economic indicators provided by the economic forecast for the region. The dependent variable is the item for which a forecast is being prepared, in this case, the tax base of the dedicated transit tax or user fee. Specialized software such as Eviews, LIMDEP, SAS, and SPSS among others is used to prepare a regression analysis and forecasts, though even a spreadsheet such as Microsoft Excel has some limited regression functions.
A regression analysis begins by collecting relevant sets of data on the dependent variable and a series of potential independent variables. Tax revenue forecasts involve the construction of a data series on the tax base and a set of explanatory variables that will be tested for predictive power.
Model Design and Specification
Model design begins with the definition of the dependent variable. If the financial analyst is interested in forecasting revenues from an existing local sales tax, the analyst must construct the retail sales variable by dividing tax revenues by the tax rate for each year to construct the tax base. The tax base is the dependent variable because the tax rate is a policy variable that tends to change periodically. If the financial plan contains a referendum to increase the tax rate, the easiest way to forecast the revenue stream is to multiply the new tax rate by the forecasted tax base.
The explanatory variables are chosen based on knowledge of simple economic relationships and experience. Explanatory variables are generally chosen from the set of variable provided by the economic forecast to ensure that long-range forecasts of the explanatory variables are available. If the financial analyst is forecasting retail sales, the set of explanatory variables must include those things that influence demand for taxable items. These variables will generally include population and income as the primary drivers of retail sales, though other factors such as employment, wages, and interest rates, among others could be tested for their explanatory power.
The regression equation for retail sales could be expressed as:
Where a is the regression constant, b ’s are parameters to be estimated, and e is the error term.
If the financial analyst is interested in car registration fee revenues, the explanatory variables may include average car prices, population, auto ownership rates, and income.
Various other variables are tested and the regression statistics evaluated to identify the functional form that "fits the data" better than any other. The test statistic that measures goodness of fit is R2, also called the coefficient of determination. This test statistic expresses the percentage of the variation in the dependent variable that is explained by the explanatory variables. The closer R2 is to 1.0, the better the explanatory variables are at "explaining" the past variation in the dependent variable.
The development of good forecasting equations is a process of trial and error and requires experience to identify the preferred regression equation. While high a R2 is a plus when evaluating a regression analysis, it does not in itself indicate that the best model specification has been found. Regression models must also be inspected to ensure that all the variables included in the model are statistically significant and have the expected sign and reasonable magnitude. A regression equation with a high R2 that exhibits unexplainable statistical relationships among the variables, is flawed and can produce biased results. The section at the end of this addendum details some of the basic principles of developing regression-based forecasts and highlights best practices in these areas.
Preparing the Forecast
After developing and testing a good regression model that produces accurate forecasts of the dependent variable, the actual tax revenue forecast may be constructed. The number of steps required to accomplish this depends on the construction of the model and how the variables were transformed, but will generally involve the following steps:
- Make sure forecasts of the explanatory variables are entered into the statistical software program. Most statistical software will include a forecasting routine that allows the user to enter this data directly. The regional economic forecast should provide this information. In rare cases where the statistical software lacks this capability, the analyst may need to use a spreadsheet to construct the forecasting model.
- Prepare the dependent variable forecast for the analysis period. The result is a forecast of the tax base in constant dollars, likely expressed as a logarithm.
- Exponentiate the series to convert the dependent variable from a natural log to its original state.
- Apply the inflation forecasts to convert the constant (real) dollar tax base forecast to nominal (current) dollars.
- Multiply the inflated tax base forecast by the expected tax rate to generate the tax revenue forecast.
Developing a set of forecasting equations in the manner described here allows the easy update for future years. As new data for the current period becomes available, the data can be updated, the equations re-estimated, and new forecasts prepared using the most current data. These revenue forecasts are entered into the financial plan as revenue source line items by year.
Example Regression Application
The following example retail sales examples were developed with national data from the Bureau of Labor Statistics, the Bureau of Economic Analysis and the Census Bureau. The data was scaled by 1/50th to reflect an average US State.
Regression techniques can be used to estimate a simple trend line as well as estimate statistical relationships between key variables. The trend line estimation is simple and is a useful place to start when developing a forecasting model.
Before beginning any forecasting exercise, the data should be transformed in several ways to maximize the usefulness of the data in a regression equation. The initial data transformations are:
- from nominal to real dollars (see Principle 1 in next section);
- from total income to income per capita (see Principle 2 in next section); and
- logarithmic transformations of all likely dependent and independent variables (see Principle 3 in next section).
The data for the example regression application is given in Table 8-25.
Table 8-25 Data for Regression Analysis of Retail Sales
|
YEAR |
Retail Sales |
Population |
Employment |
Personal Income |
CPI |
Real Retail Sales |
Real Personal Income |
Real Per Capital Income |
|
1983 |
23,403 |
4,686,140 |
1,803,040 |
58,938 |
99.6 |
23,497.2 |
59,174.7 |
12,627.6 |
|
1984 |
25,738 |
4,726,960 |
1,888,160 |
65,496 |
103.9 |
24,772.2 |
63,037.5 |
13,335.7 |
|
1985 |
27,501 |
4,769,320 |
1,947,740 |
70,300 |
107.6 |
25,558.1 |
65,334.6 |
13,698.9 |
|
1986 |
28,993 |
4,813,020 |
1,986,880 |
74,248 |
109.6 |
26,453.2 |
67,744.5 |
14,075.3 |
|
1987 |
30,826 |
4,856,080 |
2,039,160 |
79,250 |
113.6 |
27,135.5 |
69,762.3 |
14,366.0 |
|
1988 |
33,124 |
4,900,420 |
2,104,180 |
85,442 |
118.3 |
28,000.0 |
72,224.9 |
14,738.5 |
|
1989 |
35,179 |
4,946,840 |
2,157,680 |
91,996 |
124.0 |
28,370.5 |
74,190.3 |
14,997.5 |
|
1990 |
36,892 |
4,962,860 |
2,188,060 |
98,064 |
130.7 |
28,226.6 |
75,029.8 |
15,118.3 |
|
1991 |
37,119 |
5,039,100 |
2,164,980 |
101,708 |
136.2 |
27,253.1 |
74,675.5 |
14,819.2 |
|
1992 |
39,032 |
5,111,710 |
2,172,020 |
107,808 |
140.3 |
27,820.2 |
76,841.1 |
15,032.3 |
|
1993 |
41,642 |
5,181,370 |
2,214,260 |
112,200 |
144.5 |
28,818.2 |
77,647.1 |
14,985.8 |
|
1994 |
44,964 |
5,246,360 |
2,283,260 |
117,760 |
148.2 |
30,340.1 |
79,460.2 |
15,145.8 |
|
1995 |
47,180 |
5,309,440 |
2,343,820 |
124,018 |
152.4 |
30,958.2 |
81,376.6 |
15,326.8 |
|
1996 |
50,047 |
5,371,640 |
2,392,160 |
130,948 |
156.9 |
31,897.6 |
83,459.5 |
15,537.1 |
|
1997 |
52,211 |
5,436,380 |
2,453,800 |
138,740 |
160.5 |
32,530.4 |
86,442.4 |
15,900.7 |
|
1998 |
54,912 |
5,500,800 |
2,517,300 |
148,520 |
163.0 |
33,688.3 |
91,116.6 |
16,564.2 |
|
1999 |
59,899 |
5,563,910 |
2,578,320 |
155,730 |
166.6 |
35,953.5 |
93,475.4 |
16,800.3 |
|
2000 |
64,641 |
5,628,440 |
2,634,400 |
168,132 |
172.2 |
37,538.3 |
97,637.6 |
17,347.2 |
|
2001 |
69,771 |
5,695,940 |
2,638,440 |
173,706 |
177.1 |
39,396.4 |
98,083.6 |
17,219.9 |
A regression fits a line that best represents all the data by minimizing the sum of squared residuals (the vertical distance between the linear trend line and the actual data) through method called least squares estimation. Estimating a trend line using regression is accomplished by simply including a constant term and a trend variable as regressors. The trend variable used in the example is the year. The regression equation used to estimate the trend in retail sales is:
Log(retail_sales) = a + b(year) + e
where a is the constant term, b is the coefficient on the trend variable, and e is the error term.
The regression output is given in Exhibit 8-1. The results suggest that trend alone explains almost 93 percent (R2 = .927) of the variation in retail sales. Both the constant term and the trend variable are highly significant at the 99 percent level and the Durbin-Watson (D-W) statistic indicates that the regression residuals are autocorrelated (see Principle 7 in the next section). Figure 8-12 confirms that the regression residuals display a noticeable pattern and could benefit from applying some autocorrelation correction techniques.
A nice feature of regressions that use logarithmic transformations is that the coefficient estimates can be interpreted as percent changes or "elasticities". In the trend regression, the coefficient estimate for year is 0.024 or 2.4 percent. The trend line for retail sales is estimated to increase 2.4 percent annually.
Exhibit 8-1 Regression Estimate for Trend in Real Retail Sales
+-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = Log(RetSal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .2557718290E-01, Std.Dev.= .03879 | | Fit: R-squared= .927980, Adjusted R-squared = .92374 | | Model test: F[ 1, 17] = 219.04, Prob value = .00000 | | Diagnostic: Log-L = 35.8399, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.399, Akaike Info. Crt.= -3.562 | | Autocorrel: Durbin-Watson Statistic = .36060, Rho = .81970 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -37.60204780 3.2363497 -11.619 .0000 YEAR .02404533985 .16246674E-02 14.800 .0000 1992.0000
Predicted Values (* => observation was not in estimating sample.)
|
Observation |
Observed Y |
Predicted Y |
Residual |
95% Lower Bound |
95% Upper Bound |
|
1 |
10.065 |
10.08 |
-0.0152 |
9.9904 |
10.1693 |
|
2 |
10.117 |
10.104 |
0.0136 |
10.016 |
10.1922 |
|
3 |
10.149 |
10.128 |
0.0208 |
10.041 |
10.2153 |
|
4 |
10.183 |
10.152 |
0.0311 |
10.066 |
10.2384 |
|
5 |
10.209 |
10.176 |
0.0326 |
10.09 |
10.2617 |
|
6 |
10.24 |
10.2 |
0.0399 |
10.115 |
10.2852 |
|
7 |
10.253 |
10.224 |
0.029 |
10.14 |
10.3087 |
|
8 |
10.248 |
10.248 |
-0.0002 |
10.164 |
10.3324 |
|
9 |
10.213 |
10.272 |
-0.0593 |
10.188 |
10.3563 |
|
10 |
10.234 |
10.296 |
-0.0628 |
10.212 |
10.3802 |
|
11 |
10.269 |
10.32 |
-0.0516 |
10.236 |
10.4043 |
|
12 |
10.32 |
10.344 |
-0.0241 |
10.26 |
10.4286 |
|
13 |
10.34 |
10.368 |
-0.028 |
10.284 |
10.453 |
|
14 |
10.37 |
10.392 |
-0.0222 |
10.307 |
10.4775 |
|
15 |
10.39 |
10.416 |
-0.0266 |
10.331 |
10.5022 |
|
16 |
10.425 |
10.441 |
-0.0156 |
10.354 |
10.527 |
|
17 |
10.49 |
10.465 |
0.0254 |
10.377 |
10.5519 |
|
18 |
10.533 |
10.489 |
0.0445 |
10.4 |
10.577 |
|
19 |
10.581 |
10.513 |
0.0688 |
10.423 |
10.6021 |
Figure 8-12 Regression Residuals - Trend of Log (RetailSales)
Once an acceptable regression equation has been estimated, the revenue forecast can be prepared. The forecast is prepared by substituting the forecast year independent variables into the regression equation and adjusting the constant term by the final regression residual. The constant term is adjusted to ensure that the forecast is based on the last actual observed value for the dependent variable rather than the forecast value. This is accomplished by adding the final residual (0.0688) to the regression constant (-37.6). If the constant were not adjusted in this way, the first year of the forecast would be based on the predicted value for 2001 rather than the actual known value. The preparation of the revenue forecast is detailed in the following table.
Table 8-26: Trend Forecast of Retail Sales Tax Revenue
|
Year |
Log (Retail Sales) |
Real Retail Sales ($mil) |
CPI |
Retail Sales ($mil) |
% Subject to Tax |
Tax Rate (%) |
Tax Revenue ($mil) |
|
1983 |
10.0646 |
23,496 |
99.6 |
23,402 |
42.3 |
4.5 |
445.46 |
|
1984 |
10.1175 |
24,773 |
103.9 |
25,739 |
42.2 |
4.5 |
488.78 |
|
1985 |
10.1487 |
25,558 |
107.6 |
27,500 |
42.3 |
4.5 |
523.47 |
|
1986 |
10.1831 |
26,452 |
109.6 |
28,992 |
41.5 |
4.5 |
541.42 |
|
1987 |
10.2086 |
27,136 |
113.6 |
30,826 |
42.2 |
4.5 |
585.39 |
|
1988 |
10.24 |
28,001 |
118.3 |
33,125 |
42.5 |
4.5 |
633.52 |
|
1989 |
10.2531 |
28,370 |
124.0 |
35,179 |
41.1 |
4.5 |
650.64 |
|
1990 |
10.248 |
28,226 |
130.7 |
36,891 |
41.9 |
4.5 |
695.59 |
|
1991 |
10.2129 |
27,252 |
136.2 |
37,118 |
42.0 |
4.5 |
701.53 |
|
1992 |
10.2335 |
27,820 |
140.3 |
39,031 |
42.1 |
4.5 |
739.44 |
|
1993 |
10.2688 |
28,819 |
144.5 |
41,644 |
41.9 |
4.5 |
785.20 |
|
1994 |
10.3202 |
30,339 |
148.2 |
44,963 |
41.7 |
4.5 |
843.73 |
|
1995 |
10.3404 |
30,958 |
152.4 |
47,181 |
41.7 |
4.5 |
886.12 |
|
1996 |
10.3703 |
31,898 |
156.9 |
50,048 |
41.7 |
4.5 |
939.14 |
|
1997 |
10.3899 |
32,529 |
160.5 |
52,210 |
41.7 |
4.5 |
978.84 |
|
1998 |
10.4249 |
33,688 |
163.0 |
54,912 |
41.6 |
4.5 |
1,028.59 |
|
1999 |
10.49 |
35,954 |
166.6 |
59,900 |
41.6 |
4.5 |
1,121.04 |
|
2000 |
10.5331 |
37,538 |
172.2 |
64,640 |
41.6 |
4.5 |
1,208.69 |
|
2001 |
10.5814 |
39,395 |
177.1 |
69,769 |
41.5 |
4.5 |
1,303.44 |
|
2002 |
10.6055 |
40,357 |
181.9 |
73,424 |
41.5 |
4.5 |
1,371.19 |
|
2003 |
10.6296 |
41,339 |
186.3 |
77,020 |
41.5 |
4.5 |
1,438.35 |
|
2004 |
10.6536 |
42,345 |
190.7 |
80,748 |
41.5 |
5.5 |
1,843.06 |
|
2005 |
10.6777 |
43,376 |
195.1 |
84,611 |
41.5 |
5.5 |
1,931.25 |
|
2006 |
10.7017 |
44,431 |
199.4 |
88,615 |
41.5 |
5.5 |
2,022.64 |
|
2007 |
10.7257 |
45,513 |
203.8 |
92,764 |
41.5 |
5.5 |
2,117.33 |
|
2008 |
10.7498 |
46,620 |
208.2 |
97,061 |
41.5 |
5.5 |
2,215.43 |
|
2009 |
10.7738 |
47,755 |
212.6 |
101,514 |
41.5 |
5.5 |
2,317.05 |
|
2010 |
10.7979 |
48,917 |
216.9 |
106,125 |
41.5 |
5.5 |
2,422.31 |
|
2011 |
10.8219 |
50,108 |
221.3 |
110,901 |
41.5 |
5.5 |
2,531.31 |
|
2012 |
10.8460 |
51,327 |
225.7 |
115,846 |
41.5 |
5.5 |
2,644.19 |
|
2013 |
10.8700 |
52,576 |
230.1 |
120,967 |
41.5 |
5.5 |
2,761.07 |
|
2014 |
10.8941 |
53,856 |
234.5 |
126,268 |
41.5 |
5.5 |
2,882.06 |
|
2015 |
10.9181 |
55,167 |
238.8 |
131,755 |
41.5 |
5.5 |
3,007.31 |
|
2016 |
10.9422 |
56,509 |
243.2 |
137,435 |
41.5 |
5.5 |
3,136.95 |
|
2017 |
10.9662 |
57,884 |
247.6 |
143,313 |
41.5 |
5.5 |
3,271.12 |
|
2018 |
10.9902 |
59,293 |
252.0 |
149,396 |
41.5 |
5.5 |
3,409.96 |
|
2019 |
11.0143 |
60,736 |
256.3 |
155,690 |
41.5 |
5.5 |
3,553.62 |
|
2020 |
11.0383 |
62,214 |
260.7 |
162,202 |
41.5 |
5.5 |
3,702.26 |
Regression Parameters Intercept =(37.6020) Coefficient =0.0240 Last Residual =0.0688 Forecast Equations Retail sales = exp((Intercept + Last residual) + (coefficient * year))*(CPI / 100) Tax Revenue = (Retail sales) * (% subject to tax) * (Tax rate)
While regression models can be used to estimate simple trend lines like the previous example, the major strength of multiple regression models is the ability to quantify causal relationships. For retail sales, the most likely causal variables are population, employment, and income (see Principle 4 in the following section).
Once the primary causal variables of interest are identified, an initial regression model can be quickly specified and tested. The following example regression includes a constant term, log of population, log of income, and log of employment as explanatory variables for retail sales. This specification performs fairly well, explaining over 96 percent (R2 = .964) of the variation in retail sales.
Exhibit 8-2: Example Regression Model Output for Real Retail Sales
+-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = Log(RetSal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 4, Deg.Fr.= 15 | | Residuals: Sum of squares= .1282863927E-01, Std.Dev.= .02924 | | Fit: R-squared= .963877, Adjusted R-squared = .95665 | | Model test: F[ 3, 15] = 133.42, Prob value = .00000 | | Diagnostic: Log-L = 42.3951, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.873, Akaike Info. Crt.= -4.042 | | Autocorrel: Durbin-Watson Statistic = .59419, Rho = .70291 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -1.649975403 6.9794061 -.236 .8163 Log(pop) .2956528846 .60211573 .491 .6305 15.451594 Log(inc) 1.092841303 .60837137 1.796 .0926 11.257811 Log(emp) -.3369919360 .78879657 -.427 .6753 14.614784
Unfortunately, the coefficient estimates suggest problems with this regression. The negative coefficient on the employment variable is clearly wrong, since more employment will result in more, not less, retail sales. In addition, the coefficient on the population variable is too low. Recall that a regression with logarithmic transformations of all the variables allows the coefficient estimates to be interpreted as elasticities. Therefore, the coefficient on population suggests that a 1 percent increase in population would cause a 0.29 percent increase in retail sales. A more sensible value would be much closer to 1. The cause of this problem is multicollinearity between the causal variables (see Principle 2 in the next section). The following table is the correlation matrix for all the potential causal variables in the example.
Table 8-27: Correlation Matrix for Causal Variables
|
|
POP |
EMPLOY |
INC |
INCPC |
|
POP |
1.00 |
0.98566 |
0.98427 |
0.95002 |
|
EMPLOY |
0.98566 |
1.00 |
0.99571 |
0.98394 |
|
INC |
0.98427 |
0.99571 |
1.00 |
0.9891 |
|
INCPC |
0.95002 |
0.98394 |
0.9891 |
1.00 |
All the variables are highly correlated with each other, with correlation coefficients that exceed 95 percent in all cases. It is unlikely that these variables can be combined in the same regression without causing problems with multicollinearity. This is a critical problem for the model since multicollinearity causes the coefficient estimates and t-statistics to be unstable. To fix this problem, the regression should be re-estimated with a single causal variable. The following three regression outputs display the results for the retail sales regression using population, employment and real income as single regressors along with a constant term.
Exhibit 8-3: Example Single Variable Regression Outputs
+-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = log(Retsal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .2272403867E-01, Std.Dev.= .03656 | | Fit: R-squared= .936014, Adjusted R-squared = .93225 | | Model test: F[ 1, 17] = 248.68, Prob value = .00000 | | Diagnostic: Log-L = 36.9635, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.517, Akaike Info. Crt.= -3.680 | | Autocorrel: Durbin-Watson Statistic = .43612, Rho = .78194 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -23.28240372 2.1293400 -10.934 .0000 Log(pop) 2.173152690 .13780607 15.770 .0000 15.451594 +-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = log(Retsal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .1618160697E-01, Std.Dev.= .03085 | | Fit: R-squared= .954436, Adjusted R-squared = .95176 | | Model test: F[ 1, 17] = 356.10, Prob value = .00000 | | Diagnostic: Log-L = 40.1892, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.857, Akaike Info. Crt.= -4.020 | | Autocorrel: Durbin-Watson Statistic = .38191, Rho = .80904 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -7.534325645 .94491320 -7.974 .0000 Log(emp) 1.220038179 .64652798E-01 18.871 .0000 14.614784 +-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = log(Retsal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .1312234897E-01, Std.Dev.= .02778 | | Fit: R-squared= .963050, Adjusted R-squared = .96088 | | Model test: F[ 1, 17] = 443.08, Prob value = .00000 | | Diagnostic: Log-L = 42.1800, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -7.067, Akaike Info. Crt.= -4.229 | | Autocorrel: Durbin-Watson Statistic = .53525, Rho = .73238 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -.4734111606 .51167513 -.925 .3678 Log(inc) .9566407220 .45447151E-01 21.050 .0000 11.257811 Each regression performs quite well, but real income appears to perform slightly better than the other variables based on the R2 value. In addition, the Durbin-Watson statistic indicates less serial correlation of the error terms when income is the regressor. However, the answer here is not as clear as it may first appear. A solid argument can be made for using employment since the data is usually tracked more carefully and frequently than either population or income. State employment offices and the US Bureau of Labor Statistics track employment figures carefully and updated employment figures are generally available before population or income estimates. The quality and timeliness of updated data and forecasts is vital to the usefulness of the regression model, so employment could be the best choice in this example.
The Durbin-Watson statistic indicates serial correlation of the error terms, so some corrective action may be justified to derive better estimates (see Principle 6 in the next section). Most econometric software packages can correct for serial correlation automatically. Below is the output for the employment regression including the corrective first order autoregressive term AR(1). The AR(1) term is significant at the 99 percent level and the resulting Durbin-Watson statistic of 1.29 indicates that we cannot reject the null hypothesis of no first order autocorrelation at the 95 percent level. The inclusion of the autoregressive term has improved the model.
Exhibit 8-4: Example Autoregressive Model Output
+---------------------------------------------+ | AR(1) Model: e(t) = rho * e(t-1) + u(t) | | Initial value of rho = .80904 | | Iter= 6, SS= .006, Log-L= 48.822249 | | Final value of Rho = .89010 | | Durbin-Watson: e(t) = .21980 | | Std. Deviation: e(t) = .04124 | | Std. Deviation: u(t) = .01880 | | Durbin-Watson: u(t) = 1.29353 | | Autocorrelation: u(t) = .35324 | | N[0,1] used for significance levels | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -8.525992053 1.9607706 -4.348 .0000 Log(emp) 1.289517644 .13423099 9.607 .0000 14.614784 RHO .8900993672 .10742520 8.286 .0000
The resulting regression suggests that retail sales is quite sensitive to changes in employment. A 1.0 percent increase in employment is estimated to cause a 1.29 percent increase in retail sales. Generating a forecast from a model with an autoregressive term is similar to the trend example. The suggested forecasting equation can be written as:

The key issue in developing out of sample forecasts from autoregressive models is whether and how to include the autoregressive term. Since there is no actual data from which to calculate a forecasting error, there would seem to be no basis for including it in the forecasting equation. This guidance generally suggests ignoring the autoregressive term in out of sample forecasts. However, various techniques have been developed for using the autoregressive term in out of sample forecasts, but whether employing these terms is preferable to simply using the forecasting equation above, is unclear. This topic is complex and beyond the scope of this guidance, so the reader is referred to a good econometric textbook if more information is desired. Nevertheless, the use of the autoregressive term in the model provides superior estimates for the coefficients for the constant term and employment, so the use of these coefficients from the autoregressive model should be used to generate the out of sample forecast.
An example forecast is presented below. The results highlight the benefits of using a regression model with causal variables. Since retail sales is causally related to employment in the model, information about the growth in employment in the near term allows the forecast for retail sales to adjust accordingly. The simple trend analysis provides no quantitative basis for adjusting the forecast.
In the example in Table 8-28, employment growth in 2002 was assumed to be zero to reflect the slowing economy, then continue its past trend in future years. The regression based forecast for 2002 is about $33 million less under this scenario than the forecast that simply extrapolates past trends. The cumulative difference in the forecasts is about $1.2 billion over the forecast period. The much larger differences over the long term reflect the importance of incorporating updated information as quickly as possible. Forecasting errors in early years compound over time and become much larger as the length of the forecast increases.
Table 8-28 Sales Tax Revenue Forecast - Regression with Causal Variables
|
Year |
Log (Employ) |
Log (Retail Sales) |
Real Retail Sales ($mil) |
CPI |
Retail Sales ($mil) |
% Subject to Tax |
Tax Rate (%) |
Tax Revenue ($mil) |
|
1983 |
14.4050 |
10.0646 |
23,496 |
99.6 |
23,402 |
42.3 |
4.5 |
445.46 |
|
1984 |
14.4511 |
10.1175 |
24,773 |
103.9 |
25,739 |
42.2 |
4.5 |
488.78 |
|
1985 |
14.4822 |
10.1487 |
25,558 |
107.6 |
27,500 |
42.3 |
4.5 |
523.47 |
|
1986 |
14.5021 |
10.1831 |
26,452 |
109.6 |
28,992 |
41.5 |
4.5 |
541.42 |
|
1987 |
14.5280 |
10.2086 |
27,136 |
113.6 |
30,826 |
42.2 |
4.5 |
585.39 |
|
1988 |
14.5594 |
10.24 |
28,001 |
118.3 |
33,125 |
42.5 |
4.5 |
633.52 |
|
1989 |
14.5845 |
10.2531 |
28,370 |
124.0 |
35,179 |
41.1 |
4.5 |
650.64 |
|
1990 |
14.5985 |
10.248 |
28,226 |
130.7 |
36,891 |
41.9 |
4.5 |
695.59 |
|
1991 |
14.5879 |
10.2129 |
27,252 |
136.2 |
37,118 |
42.0 |
4.5 |
701.53 |
|
1992 |
14.5912 |
10.2335 |
27,820 |
140.3 |
39,031 |
42.1 |
4.5 |
739.44 |
|
1993 |
14.6104 |
10.2688 |
28,819 |
144.5 |
41,644 |
41.9 |
4.5 |
785.20 |
|
1994 |
14.6411 |
10.3202 |
30,339 |
148.2 |
44,963 |
41.7 |
4.5 |
843.73 |
|
1995 |
14.6673 |
10.3404 |
30,958 |
152.4 |
47,181 |
41.7 |
4.5 |
886.12 |
|
1996 |
14.6877 |
10.3703 |
31,898 |
156.9 |
50,048 |
41.7 |
4.5 |
939.14 |
|
1997 |
14.7131 |
10.3899 |
32,529 |
160.5 |
52,210 |
41.7 |
4.5 |
978.84 |
|
1998 |
14.7387 |
10.4249 |
33,688 |
163.0 |
54,912 |
41.6 |
4.5 |
1,028.59 |
|
1999 |
14.7626 |
10.49 |
35,954 |
166.6 |
59,900 |
41.6 |
4.5 |
1,121.04 |
|
2000 |
14.7842 |
10.5331 |
37,538 |
172.2 |
64,640 |
41.6 |
4.5 |
1,208.69 |
|
2001 |
14.7857 |
10.5814 |
39,395 |
177.1 |
69,769 |
41.5 |
4.5 |
1,303.44 |
|
2002 |
14.7857 |
10.5814 |
39,396 |
181.9 |
71,676 |
41.5 |
4.5 |
1,338.55 |
|
2003 |
14.7985 |
10.5979 |
40,052 |
186.3 |
74,622 |
41.5 |
4.5 |
1,393.56 |
|
2004 |
14.8125 |
10.6159 |
40,780 |
190.7 |
77,762 |
41.5 |
5.5 |
1,774.92 |
|
2005 |
14.8322 |
10.6414 |
41,833 |
195.1 |
81,601 |
41.5 |
5.5 |
1,862.54 |
|
2006 |
14.8520 |
10.6669 |
42,913 |
199.4 |
85,586 |
41.5 |
5.5 |
1,953.50 |
|
2007 |
14.8718 |
10.6924 |
44,021 |
203.8 |
89,723 |
41.5 |
5.5 |
2,047.92 |
|
2008 |
14.8915 |
10.7179 |
45,157 |
208.2 |
94,016 |
41.5 |
5.5 |
2,145.91 |
|
2009 |
14.9113 |
10.7434 |
46,323 |
212.6 |
98,470 |
41.5 |
5.5 |
2,247.59 |
|
2010 |
14.9311 |
10.7689 |
47,519 |
216.9 |
103,093 |
41.5 |
5.5 |
2,353.09 |
|
2011 |
14.9508 |
10.7944 |
48,746 |
221.3 |
107,888 |
41.5 |
5.5 |
2,462.55 |
|
2012 |
14.9706 |
10.8199 |
50,005 |
225.7 |
112,862 |
41.5 |
5.5 |
2,576.08 |
|
2013 |
14.9904 |
10.8454 |
51,296 |
230.1 |
118,022 |
41.5 |
5.5 |
2,693.84 |
|
2014 |
15.0102 |
10.8709 |
52,621 |
234.5 |
123,372 |
41.5 |
5.5 |
2,815.96 |
|
2015 |
15.0299 |
10.8964 |
53,979 |
238.8 |
128,920 |
41.5 |
5.5 |
2,942.60 |
|
2016 |
15.0497 |
10.9218 |
55,373 |
243.2 |
134,672 |
41.5 |
5.5 |
3,073.89 |
|
2017 |
15.0695 |
10.9473 |
56,803 |
247.6 |
140,635 |
41.5 |
5.5 |
3,210.00 |
|
2018 |
15.0892 |
10.9728 |
58,270 |
252.0 |
146,817 |
41.5 |
5.5 |
3,351.10 |
|
2019 |
15.1090 |
10.9983 |
59,774 |
256.3 |
153,224 |
41.5 |
5.5 |
3,497.33 |
|
2020 |
15.1288 |
11.0238 |
61,317 |
260.7 |
159,864 |
41.5 |
5.5 |
3,648.89 |
Regression Parameters
Intercept (8.52599) Employment Coefficient 1.28952 Last Residual 0.04100
Forecast Equation Retail sales = exp((Intercept + Last residual) + (Coefficient * Employment)) * CPI / 100 Tax Revenue = (Retail sales) * (% subject to tax) * (Tax rate)
Key Principles for Developing Regression Models
Principle 1: For regression analysis, use real rather than nominal variable constructions.
Generally speaking, all variables in a regression equation should be adjusted for inflation. The reason is that inflation is not a "real" factor. Rather it is purely a monetary scaling of all variables that are expressed in dollar amounts. If a monetary series exhibits a trend, the analyst does not know whether this is the result of actual growth or a purely nominal phenomenon that can be attributed to inflation, unless inflation is removed from the equation by expressing all monetary variables in constant dollars.
If inflation is not removed from the analysis, regression equations tend to have higher than justified R2. The reason being that the portion of the variation of the dependent variable that is attributable to inflation is known with certainty. That is, that portion of the variation in the explanatory variables that is attributable to inflation is perfectly correlated with the inflation in the dependent variable. The effect of that perfect correlation is to inflate the R2 statistic. A regression analysis for revenue forecasting should express all forecasts in constant dollars. At the end of the forecasting process, these constant dollar revenue forecasts are inflated based on the inflation rate assumptions from the economic forecast and included in the agency financial plan, which is expressed in inflated dollars.
Principle 2: Fewer variables is better (most of the time).
"…any time series regression containing more than four independent [explanatory] variables results in garbage."
Zvi Griliches, "Comments on Sims," in Frontiers of Quantitative Economics, vol.2, 1974, p. 335.
In most cases, a regression equation should have as few explanatory variables as possible. Simplicity in modeling is a virtue. Simple models have more clearly defined statistical relationships among variables and are easier to validate. Simple models are also easier to explain to non-modelers. Most importantly, models with too many variables run the risk of including variables that are correlated with each other. This problem is termed multi-collinearity. One of the assumptions of regression analysis is that all explanatory variables are independent of one another. If they are in fact, correlated, the coefficients on the explanatory variables become unstable and forecasts will be biased.
Before including a set of variables in an equation, the analyst should produce a correlation matrix to test which variables are correlated with each other and by how much. Generally speaking, most variables will display some amount of correlation, but if the correlation coefficient comes close to or exceeds 0.7, the two variables in question should not usually appear in the same equation. The result of a regression analysis when the explanatory variables are correlated can be unsettling. Coefficients can display the wrong sign, the t-statistics on the explanatory variables can be impossibly large, removing or adding a variable to the regression can make all the coefficients change dramatically. These impacts can make forecasts based on a multi-collinear regression unstable and theoretically unsound.
The most common method of dealing with this problem is to drop explanatory variables from the regression equation. This works well, as long as the regression equation still performs well and the model is fully specified. If the model is degraded, the other option is to transform some of the variables to remove the source of the collinearity. One common example of this approach is to transform an income variable to per capita income so that it can be used in an equation with population. By dividing income by population to get a per capita income value, one source of the collinearity with population is removed. This approach only works in specific cases where the source of the collinearity is clearly identifiable.
Principle 3: Use logarithmic transformations.
To ease the interpretation of results, economists often transform all their data series’ in a regression equation using natural logs. The usual revenue forecasting equation is a non-linear regression of the form:
where
y is the dependent variable
X is a vector of k independent variables
a is the regression constant
b are parameters to be estimated
e is the error term
When natural logs are applied to both sides of this equation, the result is:
which is linear in its components. This model is called the log-linear model since it is a non-linear regression that is linear in log form. In this specification, the coefficients are interpretable as elasticities:
which can be interpreted to mean the percent change in the dependent variable y given a percent change in the explanatory variable X. This is the definition of an elasticity.
A good test of a regression model forecasting retail sales from population and income would be to estimate the log-linear form of the regression model and ensure that the coefficients on population and income are in the neighborhood of 1. A coefficient of 1.0 for population means that a 1 percent increase in population leads to a 1 percent increase in retail sales, which is expected. A coefficient of 1.0 for per capita income means that a 1 percent increase in average income leads to a 1 percent increase in retail sales, which is also expected. The analyst will expect some deviation from 1.0 for population due to the reality that population growth may occur among groups that consume more or less than average. A similar explanation can be offered for average income where people’s propensity to consume taxable items from income growth may be more or less than the sample average.
A side note to logarithmic transformations is that they reduce the variance for each of the variables in the model. The effect of this is to raise the R2 statistic in the final regression model. This benefit is illusory because to get usable forecasts, the analyst must exponentiate the model results to generate forecasts in actual dollars. This reintroduces the wider variances and the final forecasts are not any better than they would be if they were prepared without the log transformation.
Other non-linear functional forms can be checked, but they will generally not be suitable for revenue forecasting where the relationships are often consistent and stable under the assumptions of the linear model. A good econometrics textbook should be consulted for suggestions about functional form. In addition, some econometric software packages will have a variety of non-linear regression specifications built into them.
Principle 4: The model should make theoretical sense.
Regression models that are used to forecast demand, such as the retail sales example, should have a sound theoretical basis behind the functional form. This means applying the basic principles of economics to the development of the functional form. For instance, demand is a function of the number of consumers, their income and the price of the product. This suggests that a model to forecasts new car sales (for a registration fee revenue forecast) would include population, income per capita, and the average price of a new car to conform to what economists believe about the structure of demand.
Once a model is defined that makes economic sense and performs well, the coefficients that are estimated should also make sense. The magnitude of the impact of a change in an explanatory variable should be reasonable and the coefficients must have the correct sign. If either of these problems arises, the most likely problem is multi-collinearity (see Principle 2). Other potential problems can be poor data or mistakes made when constructing the data series.
Many models should be estimated without a regression constant. The regression constant indicates the level of the dependent variable if the explanatory variables were zero. The dependent variable in many tax revenue forecasting models would be zero if, for instance, population and income were zero. The regression equation should be tested without the constant to attempt to find a specification that performs well without it. If a constant term is required to achieve adequate fit with the data, the likely problem is a non-linearity in the modeling relationships that occurs well outside the available data set. Non-linear regressions can be estimated, but in many cases, a constant term will need to be included despite theoretical misgivings. The modeler should ensure that if a constant term is included, it is statistically significant at the 95 percent level.
Principle 5: Test lagged variable specifications.
Sometimes, financial reactions to economic indicators are delayed. The most common instance of delayed reaction is related to investment, whether business investment or real estate, in response to changes in interest rates. The delayed reaction is most evident if quarterly data are available, but yearly data can also display lagged reactions. Lagged variable specifications should be tested for a variety of different periods by using the prior period value of the explanatory variables in the regression.
Often, the best explanatory variable in a time series regression is the lagged dependent variable. The rationale for this type of regression is that the best predictor of future conditions is often current conditions. Cyclical impacts can be modeled by using more than one lagged dependent variable. The following regression model was developed to forecast construction activity (const), as a tax base for development fees:
Where –1 indicates the prior period value and –2 indicates a value from two periods earlier. This model postulates that construction activity depends on the prime interest rate from the prior period, construction activity in the two previous periods, and a constant term. This model was found to explain 95 percent of past variation in construction activity in one metropolitan area.
Principle 6: Account for serial correlation.
One of the assumptions of multiple regression analysis is that the error terms of the regression equations are uncorrelated with each other through time. If this assumption is violated, the model is said to display serial correlation of the disturbances. In simple terms, the error terms display an observable pattern, and are therefore, not random.
The Durbin-Watson (DW) statistic, which is calculated automatically in the regression routine of all statistical software packages, is used to detect the existence of serial correlation. The rule of thumb is that when error terms are uncorrelated with each other, the DW statistic equals 2. The analyst should consult the table of critical values for the DW statistic for the relevant confidence interval given the specific regression model in question.
Time series forecasting models often violate this assumption. Modelers account for this problem by including an "autoregressive" form of the error term as follows:
This formulation states that the error term is a function of the error term from the previous period plus a random component. This is called a first-order autoregressive term. A class of models called ARIMA (autoregressive integrated moving average) models combine several auto regressive and moving average terms to generate models that rely on no explanatory variables except constructions of the dependent variable itself and the error terms of the regression. These models are a rather complex form of trend analysis.
Principle 7: Test for structural stability.
Structural change in the relationships estimated in a regression model can be the source of bias in the forecasting process. Two common tests of structural instability are the Chow breakpoint test and the Chow forecast test. These tests are performed by splitting the data at some year T and performing tests on the structural change between these two periods.
In the Chow breakpoint test, separate regressions are estimated using all data before year T and after year T. The hypothesis that the coefficients are equal in both time periods is tested using a F-test. There is no hard and fast rule for choosing T except to choose years where the analyst suspects some change in the economic relationships may have occurred, such as recessions or periods of economic instability. Often, the midpoint of the time series is chosen in case there is a consistent structural change that is occurring over the entire period. Some statistical software packages can perform the Chow breakpoint test automatically.
The Chow forecast test is similar to the Chow breakpoint test. The data before year T is used to prepare a forecast for the years after year T. An F-test is used to test the hypothesis that the forecast values are equal to the actual values. The test compares the prediction errors to the variance that is expected if the null hypothesis were true. Failure to reject the null hypothesis (the forecast errors are zero) suggests that the model shows no evidence of structural instability over the forecast period.
An example of structural instability in revenue forecasting can be seen in certain forecasts of retail sales used to estimate sales tax revenue. Sales taxes are generally levied on only a subset of goods and services. Most services, such as medical care or legal services, are untaxed. If the percentage of income spent on taxable items is changing over time, this effect can introduce structural instability into sales tax revenue forecasts. The best way to deal with structural instability is to include a variable that accounts for the effect. In this case, the percent of personal income spent on goods (as opposed to services), can be calculated from economic data and most detailed economic forecasts. Including the percent of personal income spent on goods as an explanatory variable generally removes the structural instability from sales tax forecasting models.
Another source of structural instability might be long-run changes in the age distribution of regional population, which can affect sales and income tax revenues. Changes in vehicle ownership rates can affect forecasts of licensing or registration fees. All revenue sources can be impacted by shifts in the regional economic base. Forecasts are improved when these trends are identified and incorporated into the financial forecasts.
|