Anomaly Detection and Accuracy Measurement for Categorical Data


  • Kameron Grubaugh
  • Zachary Zimmerman
  • Nicholas McAfee
  • Emily McGowan
  • Paul Evangelista United States Military Academy



The Department of Defense (DoD) recently initiated an effort to compile all inter-service maintenance data for equipment and infrastructure, requiring the consolidation of maintenance records from over 40 different data sources.  This research evaluates and improves the accuracy of this maintenance data warehouse by means of value modeling and statistical methods for anomaly detection. The first step in this work included the categorization of error-identifying metadata, which was then consolidated into a weighted scoring model. The most novel aspect of the work involved error identification processes using conditional probability combinations and likelihood measures. This analysis showed promising results, successfully identifying numerous invalid maintenance description labels through the use of conditional probability tests. This process has potential to both reduce the amount of manual labor necessary to clean the DoD maintenance data records and provide better fidelity on DoD maintenance activities.

Author Biography

Paul Evangelista, United States Military Academy

Director, Engineering Management Program

Department of Systems Engineering,

United States Military Academy

Mahan Hall, Bldg 752, Room 420

West Point, NY 10996, USA


Dunham, M. H. (2003). Data mining introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall/Pearson Education.

Barlow, H. B. (1989). Unsupervised Learning. In Neural Computation, volume 1, page 295-311.

Bayes, T., (1763). An Essay towards Solving a Problem in the Doctrine of Chances. In Philosophical Transactions. Vol. 53, page 370-418.

Bhaskaran, R., Palaniswamy, N., Rengaswamy, N. S., & Jayachandran, M. (2005). A review of differing approaches used to estimate the cost of corrosion (and their relevance in the development of modern corrosion prevention and control strategies). Anti - Corrosion Methods and Materials, 52(1), 29-41(13). Retrieved from

Das, K., & Schneider, J. (2007). Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 220-229). ACM.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.

Hanley, J.A., & McNeil, B. J. (1982). “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.” Radiology, 143(1), 29-36.

Meyer, D., Hornik, K., Zeileis, A. (2006). The Strucplot Framework: Visualizing Multi-way Contingency tables with vcd. Retrieved February 22, 2018, from

Office of the Secretary of Defense. (2014). Operating and Support Cost-Estimating Guide. Cost Assessment and Program Evaluation. Retrieved September 25, 2017, from

Office of the Under Secretary of Defense (Comptroller). (2018). Operation and Maintenance Overview: Fiscal Year 2019 Budget Estimates. Retrieved October 27, 2018, from

Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics. (2010). Prevention and Mitigation of Corrosion on DoD Military Equipment and Infrastructure. Retrieved September 25, 2017, from

Shen, D., Ruvini, J. D., & Sarwar, B. (2012). Large-scale item categorization for e-commerce. Retrieved September 25, 2017, from

Yung, Chung. (2015). Mining Massive Web Log Data of an Official Tourism Web Site as a Step towards Big Data Analysis in Tourism. Retrieved September 25, 2017, from



How to Cite

Grubaugh, K., Zimmerman, Z., McAfee, N., McGowan, E., & Evangelista, P. (2019). Anomaly Detection and Accuracy Measurement for Categorical Data. Industrial and Systems Engineering Review, 6(2), 88-94.

Most read articles by the same author(s)