﻿ 改进的固定交通检测器缺失数据综合修复方法
 文章快速检索
 同济大学学报(自然科学版)  2019, Vol. 47 Issue (10): 1477-1484.  DOI: 10.11908/j.issn.0253-374x.2019.10.013 0

### 引用本文

MIAO Xu, WANG Zhongyu, ZOU Yajie, WU Bing. Improved Modification Method of Missing Data for Location-specific Detector[J]. Journal of Tongji University (Natural Science), 2019, 47(10): 1477-1484. DOI: 10.11908/j.issn.0253-374x.2019.10.013

### 文章历史

1. 同济大学 道路与交通工程教育部重点实验室，上海 201804;
2. 上海海事大学 交通运输学院，上海 201306

Improved Modification Method of Missing Data for Location-specific Detector
MIAO Xu 1, WANG Zhongyu 2, ZOU Yajie 1, WU Bing 1
1. Key Laboratory of Road and Traffic Engineering of the Ministry of Education, Tongji University, Shanghai 201804, China;
2. College of Transport and Communications, Shanghai Maritime University, Shanghai 201306, China
Abstract: Based on the temporal and spatial correlation of detector data, the explanatory variables were dynamically selected for data repair model, and an improved modification method of missing data was proposed considering periodic trend and real-time variability comprehensively. The proposed method was assessed with the data of location-specific detectors in Shanghai, China. Compared with support vector regression(SVR) model, the mean absolute error of three detectors are reduced by 3.80%, 3.40%, 25.23%, and the mean absolute percentage error is less than 6% under different data missing conditions.
Key words: engineering of communications and transportation system    missing data modification    periodic pattern    support vector regression(SVR)

1 数据来源

 图 1 上海市南北高架检测器分布 Fig.1 Detectors on the north-south viaduct in Shanghai
2 综合数据修复方法

 $Y(t) = D(t) + R(t)$ (1)

2.1 简单平均值法周期分析

 $\left\{ {\begin{array}{*{20}{l}} {{\mathit{\boldsymbol{Y}}_1} = \left( {{Y_1}(1),{Y_1}(2),{Y_1}(3), \cdots ,{Y_1}(n)} \right)}\\ {\; \vdots }\\ {{\mathit{\boldsymbol{Y}}_N} = \left( {{Y_N}(1),{Y_N}(2),{Y_N}(3), \cdots ,{Y_N}(n)} \right)} \end{array}} \right.$ (2)
 图 2 工作日流量的周期性分析 Fig.2 Periodic analysis of flow on weekdays

 $D\left( t \right) = \frac{1}{N}\sum\limits_{r = 1}^N {{Y_r}\left( t \right)}$ (3)

2.2 动态选择解释变量的支持向量回归模型 2.2.1 备选相关序列构建

 ${R_a} = \frac{{\sum\limits_{j = 1}^n {\left( {S\left( j \right) - \bar S} \right)\left( {{S_a}(j) - \overline {{S_a}} } \right)} }}{{\sqrt {\sum\limits_{j = 1}^n {{{\left( {S\left( j \right) - \bar S} \right)}^2}} \sum\limits_{j = 1}^n {{{\left( {{S_a}(j) - \overline {{S_a}} } \right)}^2}} } }}$ (4)

2.2.2 解释变量动态选择

 图 3 解释变量选择流程 Fig.3 Flow chart of explanatory variable selection
2.2.3 支持向量回归模型

 $f\left( x \right) = \left( {\mathit{\boldsymbol{w}} \cdot \mathit{\boldsymbol{\varphi }}\left( x \right)} \right) + b,\;\;\;\;\mathit{\boldsymbol{w}} \in P$ (5)

 $R\left( \mathit{\boldsymbol{w}} \right) = \frac{1}{2}{\left\| \mathit{\boldsymbol{w}} \right\|^2} + C\sum\limits_{i = 1}^l {{H_\varepsilon }\left( {Y\left( t \right),\hat Y\left( t \right)} \right)}$ (6)

 $\begin{array}{l} {H_\varepsilon }\left( {Y\left( t \right),\hat Y\left( t \right)} \right) = \\ \left\{ {\begin{array}{*{20}{l}} {\left| {Y\left( t \right) - \hat Y\left( t \right)} \right| - \varepsilon ,\;\;\;\;\left| {Y\left( t \right) - \hat Y\left( t \right)} \right| > \varepsilon }\\ {0,\;\;\;\;\;\;其他} \end{array}} \right. \end{array}$ (7)

 $\begin{array}{l} \min \frac{1}{2}{\mathit{\boldsymbol{w}}^{\rm{T}}}\mathit{\boldsymbol{w}} + C\sum\limits_{i = 1}^l {{\xi _i}} + C\sum\limits_{i = 1}^l {\xi _i^*} \\ {\rm{s}}.\;{\rm{t}}.\;\;\;{\mathit{\boldsymbol{w}}^{\rm{T}}}\mathit{\boldsymbol{\varphi }}\left( {{x_i}} \right) + b - Y\left( t \right) \le \varepsilon + {\xi _i}\\ \;\;\;\;\;\;\;Y\left( t \right) - {\mathit{\boldsymbol{w}}^{\rm{T}}}\mathit{\boldsymbol{\varphi }}\left( {{x_i}} \right) - b \le \varepsilon + \xi _i^*\\ \;\;\;\;\;\;\;{\xi _i},\xi _i^* \ge 0,\;\;\;\;i = 1, \cdots ,l \end{array}$ (8)

 $\mathit{\boldsymbol{w}} = \sum\limits_{i = 1}^n {\left( {{\alpha _i} - \alpha _i^ * } \right)\mathit{\boldsymbol{\varphi }}\left( {{x_i}} \right)}$ (9)

 $\sum\limits_{i = 1}^l {{\alpha _i}} = \sum\limits_{i = 1}^l {\alpha _i^*} ,\;\;\;\;0 \le {\alpha _i},\alpha _i^* \le C,i = 1,2, \cdots ,l$ (10)

 $\begin{array}{*{20}{c}} {f(x) = \sum\limits_{i = 1}^l {\left( {{\alpha _i} - \alpha _i^*} \right)K\left( {{x_i},x} \right) + b} }\\ {K\left( {{x_i},x} \right) = \mathit{\boldsymbol{\varphi }}\left( {{x_i}} \right) \cdot \mathit{\boldsymbol{\varphi }}(x)} \end{array}$ (11)

ε-SVR的构建时，常数C作为惩罚系数控制损失的大小，模型求解中C可作为调节参数，影响训练模型的分类性能.此外，RBF核函数中参数g的数值也会明显影响模型的预测性能.在参数设置过程中，采用网格分析法及交叉验证法对支持向量回归中的常数C及RBF核函数参数g进行参数寻优.交叉验证法为：将原始数据均分成3组，对每组子集数据做1次验证集，其中2组子集数据作为训练集，最后得到3个模型，用这3个模型最终验证集的分类准确率平均值作为性能评价指标.网格分析法是通过编程枚举的方式对不同参数下的模型预测效果进行对比.此处以数据缺失一个的情况为例介绍惩罚系数C及核函数参数g的选择对SVR模型的影响.该实验采用均方误差(αMSE)作为评价指标，计算公式为

 ${\alpha _{{\rm{MSE}}}} = \frac{1}{{{n_1}}}\sum\limits_{t = 1}^{{n_1}} {{{\left( {Y\left( t \right) - \hat Y\left( t \right)} \right)}^2}}$ (12)

 图 4 C与g对SVR模型的影响 Fig.4 Influence of C and g on SVR model
3 实际案例及结果分析

 ${\beta _{{\rm{MAE}}}} = \frac{1}{{{n_1}}}\sum\limits_{t = 1}^{{n_1}} {\left| {\hat Y\left( t \right) - Y\left( t \right)} \right|}$ (13)
 ${\gamma _{{\rm{MAPE}}}} = \frac{1}{{{n_1}}}\sum\limits_{t = 1}^{{n_1}} {\left| {\frac{{\hat Y\left( t \right) - Y\left( t \right)}}{{Y\left( t \right)}}} \right|}$
 ${\delta _{{\rm{RMSE}}}} = \sqrt {\frac{{\sum\limits_{t = 1}^{{n_1}} {{{\left( {Y\left( t \right) - \hat Y\left( t \right)} \right)}^2}} }}{{{n_1}}}}$

(1) 解释变量动态选择

 图 5 相关序列的相关系数 Fig.5 Correlation coefficients of correlation sequences

 图 6 检测器数据的自相关系数 Fig.6 Autocorrelation coefficients of detector data

(2) 支持向量回归模型

(3) 数据修复结果

 图 7 5号检测器3次重复实验平均绝对误差 Fig.7 βMAE of 3 repeated experiments on No.5 detector

 图 8 不同连续缺失数据个数下6种模型修复平均绝对误差 Fig.8 βMAE of 6 models for different numbers of continuous missing data
 图 9 不同连续缺失数据个数下6种模型修复平均绝对百分比误差 Fig.9 γMAPE of 6 models for different numbers of continuous missing data
 图 10 不同连续缺失数据个数下6种模型修复均方根误差 Fig.10 δRMSE of 6 models for different numbers of continuous missing data

(1) 相较于传统的SVR模型，SAM-DV-SVR模型对缺失数据修复的精度显著提升.

(2) 3号检测器中SAM-SVR模型预测精度明显优于DV-SVR模型，而4号及5号检测器则呈现相反的结论.原因为3号检测器工作日每天流量的周期性变化趋势更为一致，考虑周期性的SAM-SVR模型可充分利用流量数据的周期性更好地进行缺失数据的修复.同时, 3号检测器的时间相关序列的相关系数明显大于空间相关序列的相关系数，采用DV-SVR模型在数据连续缺失达到7个时会选择空间相关序列进行数据修复，数据修复精度明显较低.4号和5号检测器空间相关序列的相关性大于时间相关序列的相关性，采用动态变量的DV-SVR模型可选择相关性强的空间相关序列作为输入变量以提升缺失数据修复精度.

(3) SAM-DV-SVR模型对5号检测器的数据修复精度提升最为明显，相较于传统的SVR模型，在数据连续缺失1~10个的情况下，平均绝对误差平均减小了25.23%，而且平均绝对百分比误差均低于5%.原因为5号检测器的流量数据既具有较为一致的日变化趋势，又与相邻检测器的空间相关序列具有较强的相关性.因此, 相较于传统的SVR模型，考虑周期性的SAM-SVR模型可提升数据修复精度，动态选择解释变量的DV-SVR模型在数据连续缺失时也可利用相关性强的空间相关序列进行数据修复以保证缺失数据的修复精度.SAM-DV-SVR模型将上述2种因素进行综合考虑，因此可较大幅度地提升5号检测器的数据修复精度.

4 结语

SAM-DV-SVR模型不仅为数据修复模型选择了最佳的解释变量，还综合考虑了交通流数据的周期性变化趋势和实时变化特征.与常用的几种数据修复模型在数据连续缺失1至10个的条件下数据修复精度的对比结果可以看出，SAM-DV-SVR模型体现了更高的数据修复精度.

 [1] 陆化普, 屈闻聪, 孙智源. 基于S-G滤波的交通流故障数据识别与修复算法[J]. 土木工程学报, 2015(5): 123 LU Huapu, QU Wencong, SUN Zhiyuan. Detection and repair algorithm of traffic erroneous data based on S-G filtering[J]. China Civil Engineering Journal, 2015(5): 123 [2] 姜桂艳, 江龙晖, 张晓东, 等. 动态交通数据故障识别与修复方法[J]. 交通运输工程学报, 2004(1): 121 JIANG Guiyan, JIANG Longhui, ZHANG Xiaodong, et al. Malfunction identifying and modifying of dynamic traffic data[J]. Journal of Traffic and Transportation Engineering, 2004(1): 121 DOI:10.3321/j.issn:1671-1637.2004.01.030 [3] 孙玲, 刘浩, 牛树云. 考虑时空相关性的固定检测缺失数据重构算法[J]. 交通运输工程学报, 2010(5): 121 SUN Ling, LIU Hao, NIU Shuyun. Reconstructive method of missing data for location-specific detector considering spatio-temporal relationship[J]. Journal of Traffic and Transportation Engineering, 2010(5): 121 DOI:10.3969/j.issn.1671-1637.2010.05.021 [4] SMITH B L, SCHERER W T, CONKLIN J H. Exploring imputation techniques for missing data in transportation management system[J]. Transportation Research Record: Journal of the Transportation Research Board, 2003, 1836(1): 132 DOI:10.3141/1836-17 [5] BOYLES S.A comparison of interpolation methods for missing traffic volume data[C]//Proceedings of the 90th Annual Meeting of the Transportation Research Board. Washington DC: Transportation Research Board, 2011: 23-27. [6] QU L, LI L, ZHANG Y, et al. PPCA-based missing data imputation for traffic flow volume: a systematical approach[J]. IEEE Transactions on Intelligent Transportation Systems, 2009, 10(3): 512 DOI:10.1109/TITS.2009.2026312 [7] QU L, LI L, ZHANG Y, et al. A BPCA based missing value imputing method for traffic flow volume data[C]//Intelligent Vehicles Symposium. Eindhoven: IEEE, 2008: 985-990. [8] LI L, LI Y, LI Z. Efficient missing data imputing for traffic flow by considering temporal and spatial dependence[J]. Transportation Research, Part C: Emerging Technologies, 2013, 34: 108 DOI:10.1016/j.trc.2013.05.008 [9] GHOSH B, BASU B, O'MAHONY M. Time-series modeling for forecasting vehicular traffic flow in Dublin[C]//Proceedings of the 84th Annual Meeting of Transportation Research Board. Washington DC: Transportation Research Board, 2005: 1-22. [10] TANG J, ZHANG G, WANG Y H, et al. A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation[J]. Transportation Research, Part C: Emerging Technologies, 2015, 51: 29 DOI:10.1016/j.trc.2014.11.003 [11] ZHANG Y, LIU Y. Missing traffic flow data prediction using least squares support vector machines in urban arterial streets[C]//Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining. Nashville: IEEE, 2009: 76-83. [12] ZOU Y, HUA X, ZHANG Y, et al. Hybrid short-term freeway speed prediction methods based on periodic analysis[J]. Canadian Journal of Civil Engineering, 2015, 42(8): 570 DOI:10.1139/cjce-2014-0447 [13] CHEN C, WANG Y, LI L, et al. The retrieval of intra-day trend and its influence on traffic prediction[J]. Transportation Research, Part C: Emerging Technologies, 2012, 22: 103 DOI:10.1016/j.trc.2011.12.006 [14] TANG J, WANG H, WANG Y, et al. Hybrid prediction approach based on weekly similarities of traffic flow for different temporal scales[J]. Transportation Research Record:Journal of the Transportation Research Board, 2014, 2443(1): 21 DOI:10.3141/2443-03 [15] 陆百川, 郭桂林, 肖汶谦, 等. 基于多尺度主元分析法的动态交通数据故障诊断与修复[J]. 重庆交通大学学报:自然科学版, 2016(1): 134 LU Baichuan, GUO Guilin, XIAO Wenqian, et al. Fault diagnosing and modifying of dynamic traffic data based on MSPCA[J]. Journal of Chongqing Jiaotong University: Natural Science, 2016(1): 134 [16] 向昌盛.基于支持向量机的时间序列组合预测模型[D].长沙: 湖南农业大学, 2011. XIANG Changsheng. Time series combination prediction model based on support vector machine[D].Changsha: Hunan Agricultural University, 2011. http://cdmd.cnki.com.cn/Article/CDMD-10537-1012298480.htm