
  • Soumana Betto Boubacar
  • Gaurav Gupta


Object Detection, Convolutional Neural Network, Vision Transformer.


Deep Learning methods have given computers the power to imagine and create new things, something not possible just a few years ago. Convolutional Neural Network (CNN) due to the powerful ability of feature learning and transfer learning, has turned to be the state of the art for object detection task of computer vision. However, much research today focuses on new architecture based on attention. Self-Attention mechanism has been shown to be very useful in diverse application domains. In this context, Vision Transformer (ViT) have been introduced and have demonstrated similar accuracy and runtime performance compared with CNN architectures in vision tasks. In this paper, we have reviewed CNN and ViT framework based on object detection. We have made a comprehensive description of their architecture, reviewed some popular state-of-the-art models, and finally, we have also drawn a comparison among those models.


Soumana Betto Boubacar, & Gaurav Gupta. (2023). A SURVEY OF CONVOLUTIONAL NEURAL NETWORKS AND VISION TRANSFORMER FRAMEWORK BASED ON OBJECT DETECTION. Journal Punjab Academy of Sciences, 23, 338–351. Retrieved from https://jpas.in/index.php/home/article/view/88