MATE-YOLO: Multi-scale Attention Task-aligned Enhanced Detection Network for Apple Fruit in Complex Orchard Environments

Shuai Dong; Xiyin Liang

doi:10.63313/JCSFT.9037

Authors

Shuai Dong College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070, China Author
Xiyin Liang College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070, China Author

DOI:

https://doi.org/10.63313/JCSFT.9037

Keywords:

Apple detection, YOLOv11, attention mechanism, feature pyramid network, task-aligned detection head, deep learning

Abstract

Accurate apple detection in complex orchards remains challenging due to foliage occlusion, illumination variations, and cluttered backgrounds. This study proposes an enhanced YOLOv11n framework integrating three architectural innovations. First, the EMCSP (EMA-enhanced Cross-Stage Partial) module is introduced into the backbone, synergistically incorporating multi-scale attention within cross-stage partial topology to strengthen discriminative feature extraction. Second, the ELA-HSFPN (Efficient Local Attention enhanced Hierarchical Scale Feature Pyramid Network) is devised for the neck, leveraging decoupled spatial attention and bidirectional hierarchical fusion to enhance multi-scale representation. Third, the TADDH (Task-Aligned Dynamic Detection Head) supersedes the conventional head, employing task decomposition, dynamic deformable convolution, and probabilistic feature modulation to achieve optimal classification-localization alignment. Extensive experiments demonstrate substantial improvements over baseline YOLOv11n: Precision+1.4%, Recall+2.3%, [email protected]+3.0%, and [email protected]:0.95 +1.7%. These results validate the efficacy of our methodology for intelligent fruit harvesting applications.

References

[1] FAO. FAOSTAT: Crops and livestock products. Food and Agriculture Organization of the United Nations, 2023. https://www.fao.org/faostat/

[2] Zhang, C., Kang, F., & Wang, Y. (2024). Review of apple picking robots: Current status and future perspectives. Computers and Electronics in Agriculture, 217, 108584

[3] Wang, C.Y., Bochkovskiy, A., & Liao, H.Y.M. (2023). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7464-7475.

[4] Jocher, G., Qiu, J., & Chaurasia, A. (2024). Ultralytics YOLO11. https://github.com/ultralytics/ultralytics

[5] Ouyang, D., He, S., Zhang, G., Luo, M., Guo, H., Zhan, J., & Huang, Z. (2023). Efficient multi-scale attention module with cross-spatial learning. In ICASSP 2023-IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5.

[6] Tan, M., Pang, R., & Le, Q.V. (2020). EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10781-10790.

[7] Feng, C., Zhong, Y., Gao, Y., Scott, M.R., & Huang, W. (2021). TOOD: Task-aligned one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3490-3499.

[8] Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13713-13722.

[9] Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8759-8768.

[10] Dai, X., Chen, Y., Xiao, B., Chen, D., Liu, M., Yuan, L., & Zhang, L. (2021). Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7373-7382.

[11] Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767.

[12] Wang, C.Y., Yeh, I.H., & Liao, H.Y.M. (2024). YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision (ECCV).

[13] Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., & Ding, G. (2024). YOLOv10: Real-time end-to-end object detection. In Advances in Neural Information Processing Systems (NeurIPS).

[14] Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13713-13722.

[15] Ouyang, D., He, S., Zhang, G., Luo, M., Guo, H., Zhan, J., & Huang, Z. (2023). Efficient multi-scale attention module with cross-spatial learning. In ICASSP 2023-IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5.

[16] Wang, C., He, W., Nie, Y., Guo, J., Liu, C., Han, K., & Wang, Y. (2024). Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. In Advances in Neural Information Processing Systems (NeurIPS).

[17] Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., & Chen, J. (2024). DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).