You are here:Planning & Project Development Procedures and Technical Methods for Transit Project Planning Financial Planning for Transit Technical Addendum: Principles and Best Practices for Regression Analysis

Technical Addendum: Principles and Best Practices for Regression Analysis


Open printable version in a new window

High quality revenue forecasting models use time series regression analysis to estimate the relationship between "explanatory" variables and a "dependent" variable. The explanatory variables can be nearly any economic indicator or other factor that could impact the dependent variable, but should be selected from the economic indicators provided by the economic forecast for the region. The dependent variable is the item for which a forecast is being prepared, in this case, the tax base of the dedicated transit tax or user fee. Specialized software such as Eviews, LIMDEP, SAS, and SPSS among others is used to prepare a regression analysis and forecasts, though even a spreadsheet such as Microsoft Excel has some limited regression functions.

A regression analysis begins by collecting relevant sets of data on the dependent variable and a series of potential independent variables. Tax revenue forecasts involve the construction of a data series on the tax base and a set of explanatory variables that will be tested for predictive power.

Model Design and Specification

Model design begins with the definition of the dependent variable. If the financial analyst is interested in forecasting revenues from an existing local sales tax, the analyst must construct the retail sales variable by dividing tax revenues by the tax rate for each year to construct the tax base. The tax base is the dependent variable because the tax rate is a policy variable that tends to change periodically. If the financial plan contains a referendum to increase the tax rate, the easiest way to forecast the revenue stream is to multiply the new tax rate by the forecasted tax base.

The explanatory variables are chosen based on knowledge of simple economic relationships and experience. Explanatory variables are generally chosen from the set of variable provided by the economic forecast to ensure that long-range forecasts of the explanatory variables are available. If the financial analyst is forecasting retail sales, the set of explanatory variables must include those things that influence demand for taxable items. These variables will generally include population and income as the primary drivers of retail sales, though other factors such as employment, wages, and interest rates, among others could be tested for their explanatory power.

The regression equation for retail sales could be expressed as:

 Equation

Where a is the regression constant, b ’s are parameters to be estimated, and e is the error term.

If the financial analyst is interested in car registration fee revenues, the explanatory variables may include average car prices, population, auto ownership rates, and income.

Various other variables are tested and the regression statistics evaluated to identify the functional form that "fits the data" better than any other. The test statistic that measures goodness of fit is R2, also called the coefficient of determination. This test statistic expresses the percentage of the variation in the dependent variable that is explained by the explanatory variables. The closer R2 is to 1.0, the better the explanatory variables are at "explaining" the past variation in the dependent variable.

The development of good forecasting equations is a process of trial and error and requires experience to identify the preferred regression equation. While high a R2 is a plus when evaluating a regression analysis, it does not in itself indicate that the best model specification has been found. Regression models must also be inspected to ensure that all the variables included in the model are statistically significant and have the expected sign and reasonable magnitude. A regression equation with a high R2 that exhibits unexplainable statistical relationships among the variables, is flawed and can produce biased results. The section at the end of this addendum details some of the basic principles of developing regression-based forecasts and highlights best practices in these areas.

Preparing the Forecast

After developing and testing a good regression model that produces accurate forecasts of the dependent variable, the actual tax revenue forecast may be constructed. The number of steps required to accomplish this depends on the construction of the model and how the variables were transformed, but will generally involve the following steps:

    1. Make sure forecasts of the explanatory variables are entered into the statistical software program. Most statistical software will include a forecasting routine that allows the user to enter this data directly. The regional economic forecast should provide this information. In rare cases where the statistical software lacks this capability, the analyst may need to use a spreadsheet to construct the forecasting model.
    2. Prepare the dependent variable forecast for the analysis period. The result is a forecast of the tax base in constant dollars, likely expressed as a logarithm.
    3. Exponentiate the series to convert the dependent variable from a natural log to its original state.
    4. Apply the inflation forecasts to convert the constant (real) dollar tax base forecast to nominal (current) dollars.
    5. Multiply the inflated tax base forecast by the expected tax rate to generate the tax revenue forecast.

Developing a set of forecasting equations in the manner described here allows the easy update for future years. As new data for the current period becomes available, the data can be updated, the equations re-estimated, and new forecasts prepared using the most current data. These revenue forecasts are entered into the financial plan as revenue source line items by year.

Example Regression Application

The following example retail sales examples were developed with national data from the Bureau of Labor Statistics, the Bureau of Economic Analysis and the Census Bureau. The data was scaled by 1/50th to reflect an average US State.

Regression techniques can be used to estimate a simple trend line as well as estimate statistical relationships between key variables. The trend line estimation is simple and is a useful place to start when developing a forecasting model.

Before beginning any forecasting exercise, the data should be transformed in several ways to maximize the usefulness of the data in a regression equation. The initial data transformations are:

  • from nominal to real dollars (see Principle 1 in next section);
  • from total income to income per capita (see Principle 2 in next section); and
  • logarithmic transformations of all likely dependent and independent variables (see Principle 3 in next section).

The data for the example regression application is given in Table 8-25.

Table 8-25 Data for Regression Analysis of Retail Sales

YEAR

Retail Sales

Population

Employment

Personal Income

CPI

Real Retail Sales

Real Personal Income

Real Per Capital Income

1983

23,403

4,686,140

1,803,040

58,938

99.6

23,497.2

59,174.7

12,627.6

1984

25,738

4,726,960

1,888,160

65,496

103.9

24,772.2

63,037.5

13,335.7

1985

27,501

4,769,320

1,947,740

70,300

107.6

25,558.1

65,334.6

13,698.9

1986

28,993

4,813,020

1,986,880

74,248

109.6

26,453.2

67,744.5

14,075.3

1987

30,826

4,856,080

2,039,160

79,250

113.6

27,135.5

69,762.3

14,366.0

1988

33,124

4,900,420

2,104,180

85,442

118.3

28,000.0

72,224.9

14,738.5

1989

35,179

4,946,840

2,157,680

91,996

124.0

28,370.5

74,190.3

14,997.5

1990

36,892

4,962,860

2,188,060

98,064

130.7

28,226.6

75,029.8

15,118.3

1991

37,119

5,039,100

2,164,980

101,708

136.2

27,253.1

74,675.5

14,819.2

1992

39,032

5,111,710

2,172,020

107,808

140.3

27,820.2

76,841.1

15,032.3

1993

41,642

5,181,370

2,214,260

112,200

144.5

28,818.2

77,647.1

14,985.8

1994

44,964

5,246,360

2,283,260

117,760

148.2

30,340.1

79,460.2

15,145.8

1995

47,180

5,309,440

2,343,820

124,018

152.4

30,958.2

81,376.6

15,326.8

1996

50,047

5,371,640

2,392,160

130,948

156.9

31,897.6

83,459.5

15,537.1

1997

52,211

5,436,380

2,453,800

138,740

160.5

32,530.4

86,442.4

15,900.7

1998

54,912

5,500,800

2,517,300

148,520

163.0

33,688.3

91,116.6

16,564.2

1999

59,899

5,563,910

2,578,320

155,730

166.6

35,953.5

93,475.4

16,800.3

2000

64,641

5,628,440

2,634,400

168,132

172.2

37,538.3

97,637.6

17,347.2

2001

69,771

5,695,940

2,638,440

173,706

177.1

39,396.4

98,083.6

17,219.9

A regression fits a line that best represents all the data by minimizing the sum of squared residuals (the vertical distance between the linear trend line and the actual data) through method called least squares estimation. Estimating a trend line using regression is accomplished by simply including a constant term and a trend variable as regressors. The trend variable used in the example is the year. The regression equation used to estimate the trend in retail sales is:

Log(retail_sales) = a + b(year) + e

where a is the constant term, b is the coefficient on the trend variable, and e is the error term.

The regression output is given in Exhibit 8-1. The results suggest that trend alone explains almost 93 percent (R2 = .927) of the variation in retail sales. Both the constant term and the trend variable are highly significant at the 99 percent level and the Durbin-Watson (D-W) statistic indicates that the regression residuals are autocorrelated (see Principle 7 in the next section). Figure 8-12 confirms that the regression residuals display a noticeable pattern and could benefit from applying some autocorrelation correction techniques.

A nice feature of regressions that use logarithmic transformations is that the coefficient estimates can be interpreted as percent changes or "elasticities". In the trend regression, the coefficient estimate for year is 0.024 or 2.4 percent. The trend line for retail sales is estimated to increase 2.4 percent annually.

Exhibit 8-1 Regression Estimate for Trend in Real Retail Sales

+-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = Log(RetSal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .2557718290E-01, Std.Dev.= .03879 | | Fit: R-squared= .927980, Adjusted R-squared = .92374 | | Model test: F[ 1, 17] = 219.04, Prob value = .00000 | | Diagnostic: Log-L = 35.8399, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.399, Akaike Info. Crt.= -3.562 | | Autocorrel: Durbin-Watson Statistic = .36060, Rho = .81970 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -37.60204780 3.2363497 -11.619 .0000 YEAR .02404533985 .16246674E-02 14.800 .0000 1992.0000



 

Predicted Values (* => observation was not in estimating sample.)

Observation

Observed Y

Predicted Y

Residual

95% Lower Bound

95% Upper Bound

1

10.065

10.08

-0.0152

9.9904

10.1693

2

10.117

10.104

0.0136

10.016

10.1922

3

10.149

10.128

0.0208

10.041

10.2153

4

10.183

10.152

0.0311

10.066

10.2384

5

10.209

10.176

0.0326

10.09

10.2617

6

10.24

10.2

0.0399

10.115

10.2852

7

10.253

10.224

0.029

10.14

10.3087

8

10.248

10.248

-0.0002

10.164

10.3324

9

10.213

10.272

-0.0593

10.188

10.3563

10

10.234

10.296

-0.0628

10.212

10.3802

11

10.269

10.32

-0.0516

10.236

10.4043

12

10.32

10.344

-0.0241

10.26

10.4286

13

10.34

10.368

-0.028

10.284

10.453

14

10.37

10.392

-0.0222

10.307

10.4775

15

10.39

10.416

-0.0266

10.331

10.5022

16

10.425

10.441

-0.0156

10.354

10.527

17

10.49

10.465

0.0254

10.377

10.5519

18

10.533

10.489

0.0445

10.4

10.577

19

10.581

10.513

0.0688

10.423

10.6021

Figure 8-12 Regression Residuals - Trend of Log (RetailSales)

Once an acceptable regression equation has been estimated, the revenue forecast can be prepared. The forecast is prepared by substituting the forecast year independent variables into the regression equation and adjusting the constant term by the final regression residual. The constant term is adjusted to ensure that the forecast is based on the last actual observed value for the dependent variable rather than the forecast value. This is accomplished by adding the final residual (0.0688) to the regression constant (-37.6). If the constant were not adjusted in this way, the first year of the forecast would be based on the predicted value for 2001 rather than the actual known value. The preparation of the revenue forecast is detailed in the following table.

Table 8-26: Trend Forecast of Retail Sales Tax Revenue

Year

Log (Retail Sales)

Real Retail Sales ($mil)

CPI

Retail Sales ($mil)

% Subject to Tax

Tax Rate (%)

Tax Revenue ($mil)

1983

10.0646

23,496

99.6

23,402

42.3

4.5

445.46

1984

10.1175

24,773

103.9

25,739

42.2

4.5

488.78

1985

10.1487

25,558

107.6

27,500

42.3

4.5

523.47

1986

10.1831

26,452

109.6

28,992

41.5

4.5

541.42

1987

10.2086

27,136

113.6

30,826

42.2

4.5

585.39

1988

10.24

28,001

118.3

33,125

42.5

4.5

633.52

1989

10.2531

28,370

124.0

35,179

41.1

4.5

650.64

1990

10.248

28,226

130.7

36,891

41.9

4.5

695.59

1991

10.2129

27,252

136.2

37,118

42.0

4.5

701.53

1992

10.2335

27,820

140.3

39,031

42.1

4.5

739.44

1993

10.2688

28,819

144.5

41,644

41.9

4.5

785.20

1994

10.3202

30,339

148.2

44,963

41.7

4.5

843.73

1995

10.3404

30,958

152.4

47,181

41.7

4.5

886.12

1996

10.3703

31,898

156.9

50,048

41.7

4.5

939.14

1997

10.3899

32,529

160.5

52,210

41.7

4.5

978.84

1998

10.4249

33,688

163.0

54,912

41.6

4.5

1,028.59

1999

10.49

35,954

166.6

59,900

41.6

4.5

1,121.04

2000

10.5331

37,538

172.2

64,640

41.6

4.5

1,208.69

2001

10.5814

39,395

177.1

69,769

41.5

4.5

1,303.44

2002

10.6055

40,357

181.9

73,424

41.5

4.5

1,371.19

2003

10.6296

41,339

186.3

77,020

41.5

4.5

1,438.35

2004

10.6536

42,345

190.7

80,748

41.5

5.5

1,843.06

2005

10.6777

43,376

195.1

84,611

41.5

5.5

1,931.25

2006

10.7017

44,431

199.4

88,615

41.5

5.5

2,022.64

2007

10.7257

45,513

203.8

92,764

41.5

5.5

2,117.33

2008

10.7498

46,620

208.2

97,061

41.5

5.5

2,215.43

2009

10.7738

47,755

212.6

101,514

41.5

5.5

2,317.05

2010

10.7979

48,917

216.9

106,125

41.5

5.5

2,422.31

2011

10.8219

50,108

221.3

110,901

41.5

5.5

2,531.31

2012

10.8460

51,327

225.7

115,846

41.5

5.5

2,644.19

2013

10.8700

52,576

230.1

120,967

41.5

5.5

2,761.07

2014

10.8941

53,856

234.5

126,268

41.5

5.5

2,882.06

2015

10.9181

55,167

238.8

131,755

41.5

5.5

3,007.31

2016

10.9422

56,509

243.2

137,435

41.5

5.5

3,136.95

2017

10.9662

57,884

247.6

143,313

41.5

5.5

3,271.12

2018

10.9902

59,293

252.0

149,396

41.5

5.5

3,409.96

2019

11.0143

60,736

256.3

155,690

41.5

5.5

3,553.62

2020

11.0383

62,214

260.7

162,202

41.5

5.5

3,702.26

Regression ParametersIntercept =(37.6020)
Coefficient =0.0240
Last Residual =0.0688
Forecast Equations
Retail sales = exp((Intercept + Last residual) + (coefficient * year))*(CPI / 100)
Tax Revenue = (Retail sales) * (% subject to tax) * (Tax rate)

While regression models can be used to estimate simple trend lines like the previous example, the major strength of multiple regression models is the ability to quantify causal relationships. For retail sales, the most likely causal variables are population, employment, and income (see Principle 4 in the following section).

Once the primary causal variables of interest are identified, an initial regression model can be quickly specified and tested. The following example regression includes a constant term, log of population, log of income, and log of employment as explanatory variables for retail sales. This specification performs fairly well, explaining over 96 percent (R2 = .964) of the variation in retail sales.

Exhibit 8-2: Example Regression Model Output for Real Retail Sales

+-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = Log(RetSal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 4, Deg.Fr.= 15 | | Residuals: Sum of squares= .1282863927E-01, Std.Dev.= .02924 | | Fit: R-squared= .963877, Adjusted R-squared = .95665 | | Model test: F[ 3, 15] = 133.42, Prob value = .00000 | | Diagnostic: Log-L = 42.3951, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.873, Akaike Info. Crt.= -4.042 | | Autocorrel: Durbin-Watson Statistic = .59419, Rho = .70291 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -1.649975403 6.9794061 -.236 .8163 Log(pop) .2956528846 .60211573 .491 .6305 15.451594 Log(inc) 1.092841303 .60837137 1.796 .0926 11.257811 Log(emp) -.3369919360 .78879657 -.427 .6753 14.614784

Unfortunately, the coefficient estimates suggest problems with this regression. The negative coefficient on the employment variable is clearly wrong, since more employment will result in more, not less, retail sales. In addition, the coefficient on the population variable is too low. Recall that a regression with logarithmic transformations of all the variables allows the coefficient estimates to be interpreted as elasticities. Therefore, the coefficient on population suggests that a 1 percent increase in population would cause a 0.29 percent increase in retail sales. A more sensible value would be much closer to 1. The cause of this problem is multicollinearity between the causal variables (see Principle 2 in the next section). The following table is the correlation matrix for all the potential causal variables in the example.

Table 8-27: Correlation Matrix for Causal Variables

 

POP

EMPLOY

INC

INCPC

POP

1.00

0.98566

0.98427

0.95002

EMPLOY

0.98566

1.00

0.99571

0.98394

INC

0.98427

0.99571

1.00

0.9891

INCPC

0.95002

0.98394

0.9891

1.00

All the variables are highly correlated with each other, with correlation coefficients that exceed 95 percent in all cases. It is unlikely that these variables can be combined in the same regression without causing problems with multicollinearity. This is a critical problem for the model since multicollinearity causes the coefficient estimates and t-statistics to be unstable. To fix this problem, the regression should be re-estimated with a single causal variable. The following three regression outputs display the results for the retail sales regression using population, employment and real income as single regressors along with a constant term.

Exhibit 8-3: Example Single Variable Regression Outputs

+-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = log(Retsal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .2272403867E-01, Std.Dev.= .03656 | | Fit: R-squared= .936014, Adjusted R-squared = .93225 | | Model test: F[ 1, 17] = 248.68, Prob value = .00000 | | Diagnostic: Log-L = 36.9635, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.517, Akaike Info. Crt.= -3.680 | | Autocorrel: Durbin-Watson Statistic = .43612, Rho = .78194 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -23.28240372 2.1293400 -10.934 .0000 Log(pop) 2.173152690 .13780607 15.770 .0000 15.451594 +-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = log(Retsal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .1618160697E-01, Std.Dev.= .03085 | | Fit: R-squared= .954436, Adjusted R-squared = .95176 | | Model test: F[ 1, 17] = 356.10, Prob value = .00000 | | Diagnostic: Log-L = 40.1892, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -6.857, Akaike Info. Crt.= -4.020 | | Autocorrel: Durbin-Watson Statistic = .38191, Rho = .80904 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -7.534325645 .94491320 -7.974 .0000 Log(emp) 1.220038179 .64652798E-01 18.871 .0000 14.614784 +-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = log(Retsal) Mean= 10.29626918 , S.D.= .1404632916 | | Model size: Observations = 19, Parameters = 2, Deg.Fr.= 17 | | Residuals: Sum of squares= .1312234897E-01, Std.Dev.= .02778 | | Fit: R-squared= .963050, Adjusted R-squared = .96088 | | Model test: F[ 1, 17] = 443.08, Prob value = .00000 | | Diagnostic: Log-L = 42.1800, Restricted(b=0) Log-L = 10.8472 | | LogAmemiyaPrCrt.= -7.067, Akaike Info. Crt.= -4.229 | | Autocorrel: Durbin-Watson Statistic = .53525, Rho = .73238 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -.4734111606 .51167513 -.925 .3678 Log(inc) .9566407220 .45447151E-01 21.050 .0000 11.257811

Each regression performs quite well, but real income appears to perform slightly better than the other variables based on the R2 value. In addition, the Durbin-Watson statistic indicates less serial correlation of the error terms when income is the regressor. However, the answer here is not as clear as it may first appear. A solid argument can be made for using employment since the data is usually tracked more carefully and frequently than either population or income. State employment offices and the US Bureau of Labor Statistics track employment figures carefully and updated employment figures are generally available before population or income estimates. The quality and timeliness of updated data and forecasts is vital to the usefulness of the regression model, so employment could be the best choice in this example.

The Durbin-Watson statistic indicates serial correlation of the error terms, so some corrective action may be justified to derive better estimates (see Principle 6 in the next section). Most econometric software packages can correct for serial correlation automatically. Below is the output for the employment regression including the corrective first order autoregressive term AR(1). The AR(1) term is significant at the 99 percent level and the resulting Durbin-Watson statistic of 1.29 indicates that we cannot reject the null hypothesis of no first order autocorrelation at the 95 percent level. The inclusion of the autoregressive term has improved the model.

Exhibit 8-4: Example Autoregressive Model Output

+---------------------------------------------+ | AR(1) Model: e(t) = rho * e(t-1) + u(t) | | Initial value of rho = .80904 | | Iter= 6, SS= .006, Log-L= 48.822249 | | Final value of Rho = .89010 | | Durbin-Watson: e(t) = .21980 | | Std. Deviation: e(t) = .04124 | | Std. Deviation: u(t) = .01880 | | Durbin-Watson: u(t) = 1.29353 | | Autocorrelation: u(t) = .35324 | | N[0,1] used for significance levels | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant -8.525992053 1.9607706 -4.348 .0000 Log(emp) 1.289517644 .13423099 9.607 .0000 14.614784 RHO .8900993672 .10742520 8.286 .0000

The resulting regression suggests that retail sales is quite sensitive to changes in employment. A 1.0 percent increase in employment is estimated to cause a 1.29 percent increase in retail sales. Generating a forecast from a model with an autoregressive term is similar to the trend example. The suggested forecasting equation can be written as:

Equation

The key issue in developing out of sample forecasts from autoregressive models is whether and how to include the autoregressive term. Since there is no actual data from which to calculate a forecasting error, there would seem to be no basis for including it in the forecasting equation. This guidance generally suggests ignoring the autoregressive term in out of sample forecasts. However, various techniques have been developed for using the autoregressive term in out of sample forecasts, but whether employing these terms is preferable to simply using the forecasting equation above, is unclear. This topic is complex and beyond the scope of this guidance, so the reader is referred to a good econometric textbook if more information is desired. Nevertheless, the use of the autoregressive term in the model provides superior estimates for the coefficients for the constant term and employment, so the use of these coefficients from the autoregressive model should be used to generate the out of sample forecast.

An example forecast is presented below. The results highlight the benefits of using a regression model with causal variables. Since retail sales is causally related to employment in the model, information about the growth in employment in the near term allows the forecast for retail sales to adjust accordingly. The simple trend analysis provides no quantitative basis for adjusting the forecast.

In the example in Table 8-28, employment growth in 2002 was assumed to be zero to reflect the slowing economy, then continue its past trend in future years. The regression based forecast for 2002 is about $33 million less under this scenario than the forecast that simply extrapolates past trends. The cumulative difference in the forecasts is about $1.2 billion over the forecast period. The much larger differences over the long term reflect the importance of incorporating updated information as quickly as possible. Forecasting errors in early years compound over time and become much larger as the length of the forecast increases.

Table 8-28 Sales Tax Revenue Forecast - Regression with Causal Variables

Year

Log (Employ)

Log (Retail Sales)

Real Retail Sales ($mil)

CPI

Retail Sales ($mil)

% Subject to Tax

Tax Rate (%)

Tax Revenue ($mil)

1983

14.4050

10.0646

23,496

99.6

23,402

42.3

4.5

445.46

1984

14.4511

10.1175

24,773

103.9

25,739

42.2

4.5

488.78

1985

14.4822

10.1487

25,558

107.6

27,500

42.3

4.5

523.47

1986

14.5021

10.1831

26,452

109.6

28,992

41.5

4.5

541.42

1987

14.5280

10.2086

27,136

113.6

30,826

42.2

4.5

585.39

1988

14.5594

10.24

28,001

118.3

33,125

42.5

4.5

633.52

1989

14.5845

10.2531

28,370

124.0

35,179

41.1

4.5

650.64

1990

14.5985

10.248

28,226

130.7

36,891

41.9

4.5

695.59

1991

14.5879

10.2129

27,252

136.2

37,118

42.0

4.5

701.53

1992

14.5912

10.2335

27,820

140.3

39,031

42.1

4.5

739.44

1993

14.6104

10.2688

28,819

144.5

41,644

41.9

4.5

785.20

1994

14.6411

10.3202

30,339

148.2

44,963

41.7

4.5

843.73

1995

14.6673

10.3404

30,958

152.4

47,181

41.7

4.5

886.12

1996

14.6877

10.3703

31,898

156.9

50,048

41.7

4.5

939.14

1997

14.7131

10.3899

32,529

160.5

52,210

41.7

4.5

978.84

1998

14.7387

10.4249

33,688

163.0

54,912

41.6

4.5

1,028.59

1999

14.7626

10.49

35,954

166.6

59,900

41.6

4.5

1,121.04

2000

14.7842

10.5331

37,538

172.2

64,640

41.6

4.5

1,208.69

2001

14.7857

10.5814

39,395

177.1

69,769

41.5

4.5

1,303.44

2002

14.7857

10.5814

39,396

181.9

71,676

41.5

4.5

1,338.55

2003

14.7985

10.5979

40,052

186.3

74,622

41.5

4.5

1,393.56

2004

14.8125

10.6159

40,780

190.7

77,762

41.5

5.5

1,774.92

2005

14.8322

10.6414

41,833

195.1

81,601

41.5

5.5

1,862.54

2006

14.8520

10.6669

42,913

199.4

85,586

41.5

5.5

1,953.50

2007

14.8718

10.6924

44,021

203.8

89,723

41.5

5.5

2,047.92

2008

14.8915

10.7179

45,157

208.2

94,016

41.5

5.5

2,145.91

2009

14.9113

10.7434

46,323

212.6

98,470

41.5

5.5

2,247.59

2010

14.9311

10.7689

47,519

216.9

103,093

41.5

5.5

2,353.09

2011

14.9508

10.7944

48,746

221.3

107,888

41.5

5.5

2,462.55

2012

14.9706

10.8199

50,005

225.7

112,862

41.5

5.5

2,576.08

2013

14.9904

10.8454

51,296

230.1

118,022

41.5

5.5

2,693.84

2014

15.0102

10.8709

52,621

234.5

123,372

41.5

5.5

2,815.96

2015

15.0299

10.8964

53,979

238.8

128,920

41.5

5.5

2,942.60

2016

15.0497

10.9218

55,373

243.2

134,672

41.5

5.5

3,073.89

2017

15.0695

10.9473

56,803

247.6

140,635

41.5

5.5

3,210.00

2018

15.0892

10.9728

58,270

252.0

146,817

41.5

5.5

3,351.10

2019

15.1090

10.9983

59,774

256.3

153,224

41.5

5.5

3,497.33

2020

15.1288

11.0238

61,317

260.7

159,864

41.5

5.5

3,648.89

Regression Parameters
Intercept (8.52599)
Employment Coefficient 1.28952
Last Residual 0.04100

Forecast Equation
Retail sales = exp((Intercept + Last residual) + (Coefficient * Employment)) * CPI / 100
Tax Revenue = (Retail sales) * (% subject to tax) * (Tax rate)

Key Principles for Developing Regression Models

Principle 1: For regression analysis, use real rather than nominal variable constructions.

Generally speaking, all variables in a regression equation should be adjusted for inflation. The reason is that inflation is not a "real" factor. Rather it is purely a monetary scaling of all variables that are expressed in dollar amounts. If a monetary series exhibits a trend, the analyst does not know whether this is the result of actual growth or a purely nominal phenomenon that can be attributed to inflation, unless inflation is removed from the equation by expressing all monetary variables in constant dollars.

If inflation is not removed from the analysis, regression equations tend to have higher than justified R2. The reason being that the portion of the variation of the dependent variable that is attributable to inflation is known with certainty. That is, that portion of the variation in the explanatory variables that is attributable to inflation is perfectly correlated with the inflation in the dependent variable. The effect of that perfect correlation is to inflate the R2 statistic. A regression analysis for revenue forecasting should express all forecasts in constant dollars. At the end of the forecasting process, these constant dollar revenue forecasts are inflated based on the inflation rate assumptions from the economic forecast and included in the agency financial plan, which is expressed in inflated dollars.

Principle 2: Fewer variables is better (most of the time).

"…any time series regression containing more than four independent [explanatory] variables results in garbage."

Zvi Griliches, "Comments on Sims," in Frontiers of Quantitative Economics, vol.2, 1974, p. 335.

In most cases, a regression equation should have as few explanatory variables as possible. Simplicity in modeling is a virtue. Simple models have more clearly defined statistical relationships among variables and are easier to validate. Simple models are also easier to explain to non-modelers. Most importantly, models with too many variables run the risk of including variables that are correlated with each other. This problem is termed multi-collinearity. One of the assumptions of regression analysis is that all explanatory variables are independent of one another. If they are in fact, correlated, the coefficients on the explanatory variables become unstable and forecasts will be biased.

Before including a set of variables in an equation, the analyst should produce a correlation matrix to test which variables are correlated with each other and by how much. Generally speaking, most variables will display some amount of correlation, but if the correlation coefficient comes close to or exceeds 0.7, the two variables in question should not usually appear in the same equation. The result of a regression analysis when the explanatory variables are correlated can be unsettling. Coefficients can display the wrong sign, the t-statistics on the explanatory variables can be impossibly large, removing or adding a variable to the regression can make all the coefficients change dramatically. These impacts can make forecasts based on a multi-collinear regression unstable and theoretically unsound.

The most common method of dealing with this problem is to drop explanatory variables from the regression equation. This works well, as long as the regression equation still performs well and the model is fully specified. If the model is degraded, the other option is to transform some of the variables to remove the source of the collinearity. One common example of this approach is to transform an income variable to per capita income so that it can be used in an equation with population. By dividing income by population to get a per capita income value, one source of the collinearity with population is removed. This approach only works in specific cases where the source of the collinearity is clearly identifiable.

Principle 3: Use logarithmic transformations.

To ease the interpretation of results, economists often transform all their data series’ in a regression equation using natural logs. The usual revenue forecasting equation is a non-linear regression of the form:

 Equation

where

y is the dependent variable

X is a vector of k independent variables

a is the regression constant

b are parameters to be estimated

e is the error term

When natural logs are applied to both sides of this equation, the result is:

 Equation

which is linear in its components. This model is called the log-linear model since it is a non-linear regression that is linear in log form. In this specification, the coefficients are interpretable as elasticities:

 Equation

which can be interpreted to mean the percent change in the dependent variable y given a percent change in the explanatory variable X. This is the definition of an elasticity.

A good test of a regression model forecasting retail sales from population and income would be to estimate the log-linear form of the regression model and ensure that the coefficients on population and income are in the neighborhood of 1. A coefficient of 1.0 for population means that a 1 percent increase in population leads to a 1 percent increase in retail sales, which is expected. A coefficient of 1.0 for per capita income means that a 1 percent increase in average income leads to a 1 percent increase in retail sales, which is also expected. The analyst will expect some deviation from 1.0 for population due to the reality that population growth may occur among groups that consume more or less than average. A similar explanation can be offered for average income where people’s propensity to consume taxable items from income growth may be more or less than the sample average.

A side note to logarithmic transformations is that they reduce the variance for each of the variables in the model. The effect of this is to raise the R2 statistic in the final regression model. This benefit is illusory because to get usable forecasts, the analyst must exponentiate the model results to generate forecasts in actual dollars. This reintroduces the wider variances and the final forecasts are not any better than they would be if they were prepared without the log transformation.

Other non-linear functional forms can be checked, but they will generally not be suitable for revenue forecasting where the relationships are often consistent and stable under the assumptions of the linear model. A good econometrics textbook should be consulted for suggestions about functional form. In addition, some econometric software packages will have a variety of non-linear regression specifications built into them.

Principle 4: The model should make theoretical sense.

Regression models that are used to forecast demand, such as the retail sales example, should have a sound theoretical basis behind the functional form. This means applying the basic principles of economics to the development of the functional form. For instance, demand is a function of the number of consumers, their income and the price of the product. This suggests that a model to forecasts new car sales (for a registration fee revenue forecast) would include population, income per capita, and the average price of a new car to conform to what economists believe about the structure of demand.

Once a model is defined that makes economic sense and performs well, the coefficients that are estimated should also make sense. The magnitude of the impact of a change in an explanatory variable should be reasonable and the coefficients must have the correct sign. If either of these problems arises, the most likely problem is multi-collinearity (see Principle 2). Other potential problems can be poor data or mistakes made when constructing the data series.

Many models should be estimated without a regression constant. The regression constant indicates the level of the dependent variable if the explanatory variables were zero. The dependent variable in many tax revenue forecasting models would be zero if, for instance, population and income were zero. The regression equation should be tested without the constant to attempt to find a specification that performs well without it. If a constant term is required to achieve adequate fit with the data, the likely problem is a non-linearity in the modeling relationships that occurs well outside the available data set. Non-linear regressions can be estimated, but in many cases, a constant term will need to be included despite theoretical misgivings. The modeler should ensure that if a constant term is included, it is statistically significant at the 95 percent level.

Principle 5: Test lagged variable specifications.

Sometimes, financial reactions to economic indicators are delayed. The most common instance of delayed reaction is related to investment, whether business investment or real estate, in response to changes in interest rates. The delayed reaction is most evident if quarterly data are available, but yearly data can also display lagged reactions. Lagged variable specifications should be tested for a variety of different periods by using the prior period value of the explanatory variables in the regression.

Often, the best explanatory variable in a time series regression is the lagged dependent variable. The rationale for this type of regression is that the best predictor of future conditions is often current conditions. Cyclical impacts can be modeled by using more than one lagged dependent variable. The following regression model was developed to forecast construction activity (const), as a tax base for development fees:

 Equation

Where –1 indicates the prior period value and –2 indicates a value from two periods earlier. This model postulates that construction activity depends on the prime interest rate from the prior period, construction activity in the two previous periods, and a constant term. This model was found to explain 95 percent of past variation in construction activity in one metropolitan area.

Principle 6: Account for serial correlation.

One of the assumptions of multiple regression analysis is that the error terms of the regression equations are uncorrelated with each other through time. If this assumption is violated, the model is said to display serial correlation of the disturbances. In simple terms, the error terms display an observable pattern, and are therefore, not random.

The Durbin-Watson (DW) statistic, which is calculated automatically in the regression routine of all statistical software packages, is used to detect the existence of serial correlation. The rule of thumb is that when error terms are uncorrelated with each other, the DW statistic equals 2. The analyst should consult the table of critical values for the DW statistic for the relevant confidence interval given the specific regression model in question.

Time series forecasting models often violate this assumption. Modelers account for this problem by including an "autoregressive" form of the error term as follows:

 Equation

This formulation states that the error term is a function of the error term from the previous period plus a random component. This is called a first-order autoregressive term. A class of models called ARIMA (autoregressive integrated moving average) models combine several auto regressive and moving average terms to generate models that rely on no explanatory variables except constructions of the dependent variable itself and the error terms of the regression. These models are a rather complex form of trend analysis.

Principle 7: Test for structural stability.

Structural change in the relationships estimated in a regression model can be the source of bias in the forecasting process. Two common tests of structural instability are the Chow breakpoint test and the Chow forecast test. These tests are performed by splitting the data at some year T and performing tests on the structural change between these two periods.

In the Chow breakpoint test, separate regressions are estimated using all data before year T and after year T. The hypothesis that the coefficients are equal in both time periods is tested using a F-test. There is no hard and fast rule for choosing T except to choose years where the analyst suspects some change in the economic relationships may have occurred, such as recessions or periods of economic instability. Often, the midpoint of the time series is chosen in case there is a consistent structural change that is occurring over the entire period. Some statistical software packages can perform the Chow breakpoint test automatically.

The Chow forecast test is similar to the Chow breakpoint test. The data before year T is used to prepare a forecast for the years after year T. An F-test is used to test the hypothesis that the forecast values are equal to the actual values. The test compares the prediction errors to the variance that is expected if the null hypothesis were true. Failure to reject the null hypothesis (the forecast errors are zero) suggests that the model shows no evidence of structural instability over the forecast period.

An example of structural instability in revenue forecasting can be seen in certain forecasts of retail sales used to estimate sales tax revenue. Sales taxes are generally levied on only a subset of goods and services. Most services, such as medical care or legal services, are untaxed. If the percentage of income spent on taxable items is changing over time, this effect can introduce structural instability into sales tax revenue forecasts. The best way to deal with structural instability is to include a variable that accounts for the effect. In this case, the percent of personal income spent on goods (as opposed to services), can be calculated from economic data and most detailed economic forecasts. Including the percent of personal income spent on goods as an explanatory variable generally removes the structural instability from sales tax forecasting models.

Another source of structural instability might be long-run changes in the age distribution of regional population, which can affect sales and income tax revenues. Changes in vehicle ownership rates can affect forecasts of licensing or registration fees. All revenue sources can be impacted by shifts in the regional economic base. Forecasts are improved when these trends are identified and incorporated into the financial forecasts.