A SURVEY OF CONVOLUTIONAL NEURAL NETWORKS AND VISION TRANSFORMER FRAMEWORK BASED ON OBJECT DETECTION
Keywords:
Object Detection, Convolutional Neural Network, Vision Transformer.Abstract
Deep Learning methods have given computers the power to imagine and create new things, something not possible just a few years ago. Convolutional Neural Network (CNN) due to the powerful ability of feature learning and transfer learning, has turned to be the state of the art for object detection task of computer vision. However, much research today focuses on new architecture based on attention. Self-Attention mechanism has been shown to be very useful in diverse application domains. In this context, Vision Transformer (ViT) have been introduced and have demonstrated similar accuracy and runtime performance compared with CNN architectures in vision tasks. In this paper, we have reviewed CNN and ViT framework based on object detection. We have made a comprehensive description of their architecture, reviewed some popular state-of-the-art models, and finally, we have also drawn a comparison among those models.
References
Ball, John E., Derek T. Anderson, and Chee Seng Chan. 2017. “Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools, and Challenges for the Community.” Journal of Applied Remote Sensing 11 (04): 1. https://doi.org/10.1117/1.JRS.11.042609.
Bengio, Y., A. Courville, and P. Vincent. 2013. “Representation Learning: A Review and New Perspectives.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8): 1798–1828. https://doi.org/10.1109/TPAMI.2013.50.
Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. “YOLOv4: Optimal Speed and Accuracy of Object Detection.” arXiv. http://arxiv.org/abs/2004.10934.
Bruna, Joan, Soumith Chintala, Yann LeCun, Serkan Piantino, Arthur Szlam, and Mark Tygert. 2015. “A Mathematical Motivation for Complex-Valued Convolutional Networks.” arXiv. http://arxiv.org/abs/1503.03438.
Cai, Zhaowei, and Nuno Vasconcelos. 2017. “Cascade R-CNN: Delving into High Quality Object Detection.” arXiv. http://arxiv.org/abs/1712.00726.
Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. “End-to-End Object Detection with Transformers.” arXiv. http://arxiv.org/abs/2005.12872.
Chaudhari, Sneha, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. 2021. “An Attentive Survey of Attention Models.” arXiv. http://arxiv.org/abs/1904.02874.
Chen, Chun-Fu, Quanfu Fan, and Rameswar Panda. 2021. “CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification.” arXiv. http://arxiv.org/abs/2103.14899.
Dahl, G. E., Dong Yu, Li Deng, and A. Acero. 2012. “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition.” IEEE Transactions on Audio, Speech, and Language Processing 20 (1): 30–42. https://doi.org/10.1109/TASL.2011.2134090.
Devlin. n.d. “Devlin et al.,BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding.”
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv. http://arxiv.org/abs/2010.11929.
Girshick, Ross. 2015. “Fast R-CNN.” arXiv. http://arxiv.org/abs/1504.08083.
Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.” arXiv. http://arxiv.org/abs/1311.2524.
Guo, Jianyuan, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. 2022. “CMT: Convolutional Neural Networks Meet Vision Transformers.” arXiv. http://arxiv.org/abs/2107.06263.
He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2018. “Mask R-CNN.” arXiv. http://arxiv.org/abs/1703.06870.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015a. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.” arXiv. http://arxiv.org/abs/1502.01852.
———. 2015b. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9): 1904–16. https://doi.org/10.1109/TPAMI.2015.2389824.
———. 2015c. “Deep Residual Learning for Image Recognition.” arXiv. http://arxiv.org/abs/1512.03385.
Hinton, G. E., and R. R. Salakhutdinov. 2006. “Reducing the Dimensionality of Data with Neural Networks.” Science 313 (5786): 504–7. https://doi.org/10.1126/science.1127647.
Hinton, Geoffrey, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.” IEEE Signal Processing Magazine 29 (6): 82–97. https://doi.org/10.1109/MSP.2012.2205597.
Jiao, Licheng, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng, and Rong Qu. 2019. “A Survey of Deep Learning-Based Object Detection.” IEEE Access 7: 128837–68. https://doi.org/10.1109/ACCESS.2019.2939201.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2017. “ImageNet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60 (6): 84–90. https://doi.org/10.1145/3065386.
Ledig, Christian, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, et al. 2017. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.” arXiv. http://arxiv.org/abs/1609.04802.
Li, Chuyi, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li, Zaidan Ke, et al. 2022. “YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications.” arXiv. http://arxiv.org/abs/2209.02976.
Li, Shasha, Yongjun Li, Yao Li, Mengjun Li, and Xiaorong Xu. 2021. “YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection” 9.
Li, Yanghao, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2019. “Scale-Aware Trident Networks for Object Detection.” arXiv. http://arxiv.org/abs/1901.01892.
Lin, Kevin, Lijuan Wang, and Zicheng Liu. 2021. “End-to-End Human Pose and Mesh Reconstruction with Transformers.” arXiv. http://arxiv.org/abs/2012.09760.
Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. “Feature Pyramid Networks for Object Detection.” arXiv. http://arxiv.org/abs/1612.03144.
Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018. “Focal Loss for Dense Object Detection.” arXiv. http://arxiv.org/abs/1708.02002.
Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. “Microsoft COCO: Common Objects in Context.” arXiv. http://arxiv.org/abs/1405.0312.
Liu, Li, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. 2020. “Deep Learning for Generic Object Detection: A Survey.” International Journal of Computer Vision 128 (2): 261–318. https://doi.org/10.1007/s11263-019-01247-4.
Liu, Shu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. “Path Aggregation Network for Instance Segmentation.” arXiv. http://arxiv.org/abs/1803.01534.
Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. “SSD: Single Shot MultiBox Detector.” In , 9905:21–37. https://doi.org/10.1007/978-3-319-46448-0_2.
Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” arXiv. http://arxiv.org/abs/2103.14030.
Nisa, Sehar Un, and Muhammad Imran. 2019. “A Critical Review of Object Detection Using Convolution Neural Network.” In 2019 2nd International Conference on Communication, Computing and Digital Systems (C-CODE), 154–59. Islamabad, Pakistan: IEEE. https://doi.org/10.1109/C-CODE.2019.8681010.
Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” arXiv. http://arxiv.org/abs/1912.01703.
Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” arXiv. http://arxiv.org/abs/1910.10683.
Rahman, Raian, Zadid Bin Azad, and Md Bakhtiar Hasan. 2022. “Densely-Populated Traffic Detection Using YOLOv5 and Non-Maximum Suppression Ensembling.” In , 95:567–78. https://doi.org/10.1007/978-981-16-6636-0_43.
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. “You Only Look Once: Unified, Real-Time Object Detection.” arXiv. http://arxiv.org/abs/1506.02640.
Redmon, Joseph, and Ali Farhadi. 2016. “YOLO9000: Better, Faster, Stronger.” arXiv. http://arxiv.org/abs/1612.08242.
———. 2018. “YOLOv3: An Incremental Improvement.” arXiv. http://arxiv.org/abs/1804.02767.
Reichstein, Markus, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Carvalhais, and Prabhat. 2019. “Deep Learning and Process Understanding for Data-Driven Earth System Science.” Nature 566 (7743): 195–204. https://doi.org/10.1038/s41586-019-0912-1.
Reis, Dillon, Jordan Kupec, Jacqueline Hong, and Ahmad Daoudi. 2023. “Real-Time Flying Object Detection with YOLOv8.” arXiv. http://arxiv.org/abs/2305.09972.
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2017. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6): 1137–49. https://doi.org/10.1109/TPAMI.2016.2577031.
Rezatofighi, Hamid, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. “Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression.” arXiv. http://arxiv.org/abs/1902.09630.
Tan, Mingxing, and Quoc V. Le. 2020. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” arXiv. http://arxiv.org/abs/1905.11946.
Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. “Efficient Transformers: A Survey.” arXiv. http://arxiv.org/abs/2009.06732.
Touvron, Hugo, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. “Training Data-Efficient Image Transformers & Distillation through Attention.” arXiv. http://arxiv.org/abs/2012.12877.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. “Attention Is All You Need.” arXiv. http://arxiv.org/abs/1706.03762.
Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2022. “YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors.” arXiv. http://arxiv.org/abs/2207.02696.
Wu, Xiongwei, Doyen Sahoo, and Steven C.H. Hoi. 2020. “Recent Advances in Deep Learning for Object Detection.” Neurocomputing 396 (July): 39–64. https://doi.org/10.1016/j.neucom.2020.01.085.
Xie, Saining, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. “Aggregated Residual Transformations for Deep Neural Networks.” arXiv. http://arxiv.org/abs/1611.05431.
Yang, Baosong, Longyue Wang, Derek Wong, Lidia S. Chao, and Zhaopeng Tu. 2019. “Convolutional Self-Attention Networks.” arXiv. http://arxiv.org/abs/1904.03107.
Yang, Michael. 2022. “Visual Transformer for Object Detection.” arXiv. http://arxiv.org/abs/2206.06323.
Yuan, Li, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet.” arXiv. http://arxiv.org/abs/2101.11986.
Zhang, Liangpei, Lefei Zhang, and Bo Du. 2016. “Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art.” IEEE Geoscience and Remote Sensing Magazine 4 (2): 22–40. https://doi.org/10.1109/MGRS.2016.2540798.
Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. 2019. “Objects as Points.” arXiv. http://arxiv.org/abs/1904.07850.
Zhu, Xingkui, Shuchang Lyu, Xu Wang, and Qi Zhao. 2021. “TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios.” https://doi.org/10.48550/ARXIV.2108.11539.
Zintgraf, Luisa M., Taco S. Cohen, and Max Welling. 2017. “A New Method to Visualize Deep Neural Networks.” arXiv. http://arxiv.org/abs/1603.02518.