PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

arXiv Vanity renders academic papers from arXiv as responsive web pages so you don’t have to squint at a PDF View this paper on arXiv

Abstract

Optical character recognition (OCR) technology has been widely used in various scenes, as shown in Figure 1 . Designing a practical OCR system is still a meaningful but challenging task. In previous work, considering the efficiency and accuracy, we proposed a practical ultra lightweight OCR system (PP-OCR), and an optimized version PP-OCRv2. In order to further improve the performance of PP-OCRv2, a more robust OCR system PP-OCRv3 is proposed in this paper. PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2. For text detector, we introduce a PAN module with large receptive field named LK-PAN, a FPN module with residual attention mechanism named RSE-FPN, and DML distillation strategy. For text recognizer, the base model is replaced from CRNN to SVTR, and we introduce lightweight text recognition network SVTR_LCNet, guided training of CTC by attention, data augmentation strategy TextConAug, better pre-trained model by self-supervised TextRotNet, UDML, and UIM to accelerate the model and improve the effect. Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed. All the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR ¹ ¹ 1 https://github.com/PaddlePaddle/PaddleOCR which is powered by PaddlePaddle ² ² 2 https://github.com/PaddlePaddle .

\affiliations

Baidu Inc.
{lichenxia,

OCR (Optical Character Recognition) in the wild, as shown in Figuie 1 , has various applications scenarios, such as document electronization, identity authentication, digital financial system, and vehicle license plate recognition. In recent years, researchers have conducted in-depth research on the sub problems of text detection and text recognition in OCR. Many effective algorithms have been proposed, such as DB Liao et al. ( 2020 ) for text detection and CRNN Shi et al. ( 2016 ) for text recognition. By connecting the detection model and recognition model, a common two-stage OCR system can be obtained. In practical industrial applications, OCR often needs to be deployed in various software and hardware environments, so the storage space or computing resources are often limited, such as a deployment on a mobile phone. Therefore, when we build an OCR system in practical, not only the accuracy is considered, but also the computational efficiency.

Previously, we proposed a practical ultra lightweight OCR system (PP-OCR) Du et al. ( 2020 ) to balance the accuracy against the efficiency and PP-OCRv2 Du et al. ( 2021 ) to further improve the accuracy without increasing prediction cost. The hmean of PP-OCRv2 is 7% higher than that of PP-OCR under the same inference cost and is comparable to the server models of the PP-OCR which use ResNet series as backbones. However, there are still some badcases to be optimized, such as missed detection of single word, misrecognition problem, as shown in Figure 11 and Figuie 12 . In this paper, we propose PP-OCRv3, which is a more robust OCR system, and can better figure the mentioned problems.

Figure 2: The framework of the proposed PP-OCRv3. The strategies in the green boxes are the same as PP-OCRv2. The strategies in the pink boxes are the newly added ones in the PP-OCRv3. The strategies in the gray boxes are adopted by the PP-OCRv3-tiny.

Figure 2 illustrates the framework of PP-OCRv3. PP-OCRv3 is further upgraded on the basis of PP-OCRv2. The overall framework of PP-OCRv3 is same as that of PP-OCRv2, which consists of three parts, text detection, detected boxes rectification and text recognition. In PP-OCRv3, the text detection model and text recognition model are further optimized, respectively. Specifically, the detection network is still optimized based on DB Liao et al. ( 2020 ) , while base model of recognition network is replaced from CRNN Shi et al. ( 2016 ) to SVTR Du et al. ( 2022 ) .

Most strategies follow PP-OCR and PP-OCRv2 as shown in the green boxes. The strategies in the pink boxes are the additional ones in PP-OCRv3.

For text detector, we introduce a PAN module with large receptive field named LK-PAN, a FPN module with residual attention mechanism named RSE-FPN, and DML Zhang et al. ( 2017 ) distillation strategy. LK-PAN and DML are used to improve the performance of the teacher model, while RSE-FPN is integrated into the student network. With a better performed teacher model and an optimised student network, a better detection model can be trained with Collaborative Mutual Learning (CML) Du et al. ( 2021 ) .

For text recognizer, we introduce lightweight text recognition network SVTR_LCNet, guided training of CTC by attention, data augmentation strategy TextConAug, better pre-trained model by self-supervised TextRotNet, UDML, and UIM to accelerate the model and improve the effect. SVTR_LCNet is a newly designed lightweight text recognition network which combined the transformer based algorithm SVTR Du et al. ( 2022 ) and the convolution based algorithm PP-LCNet Cui et al. ( 2021 ) , which was utilized as the backbone of the recognizer of PP-OCRv2, so as to combine their advantages in accuracy and speed. The other six strategies, including guided training of CTC by attention, data augmentation strategy TextConAug, better pre-trained model by self-supervised TextRotNet, UDML, and UIM are introduced to improve the accuracy without increasing any prediction cost.

Besides, the strategies in the gray boxes of Figure 2 are adopted to speed up the inference in PP-OCRv3-tiny.

We conduct a series of ablation experiments to verify the effectiveness of the above strategies. Experiments on real data show that the precision of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed.

The rest of the paper is organized as follows. In section 2, we present the details of the newly proposed improvement strategies. Experimental results are discussed in section 3 and conclusions are conducted in section 4.

The PP-OCRv3 detection model upgrades the CML (Collaborative Mutual Learning) distillation strategy proposed in PP-OCRv2. As shown in Figure 3 , the main idea of CML combines the traditional distillation strategy of Teacher guiding Student and the DML strategy, which allows the Students network to learn from each other. PP-OCRv3 further optimizes the effect of teacher model and student model respectively. For the Teacher model, a pan module with large receptive field named LK-PAN is proposed and the DML distillation strategy is adopted; for the student model, a FPN module with residual attention mechanism named RSE-FPN is proposed.

Figure 3: The CML framework and training process of PP-OCRv3 detection model.

Large Kernel PAN(LK-PAN)

LK-PAN (Large Kernel PAN) is a lightweight PAN Liu et al. ( 2018 ) structure with larger receptive field. The main idea is to change the convolution kernel size in the path augmentation of the PAN structure from $3 \times 3$ to $9 \times 9$ . By increasing the convolution kernel size, the receptive field of each position of the feature map is improved, making it easier to detect text in large fonts and text with extreme aspect ratios. Using LK-PAN, the hmean of the teacher model can be improved from 83.2% to 85.0%. The schematic diagram of LK-PAN in PP-OCRv3 is shown in Figure 4 .

Figure 4: The schematic diagram of LK-PAN.

Deep Mutual Learning(DML)

DML(Collaborative Mutual Learning) Zhang et al. ( 2018 ) can effectively improve the accuracy of the text detection model by learning from each other with two models with the same structure. The DML strategy is adopted in the teacher model training, and the hmean is increased from 85% to 86%. By updating the teacher model of CML in PP-OCRv2 to the above-mentioned higher-precision one, the hmean of the student model can be further improved from 83.2% to 84.3%. The schematic diagram of DML in PP-OCRv3 is shown in Figure 5 .

Figure 5: The schematic diagram of DML.

Residual Squeeze-and-Excitation FPN (RSE-FPN)

RSE-FPN (Residual Squeeze-and-Excitation FPN) introduces residual attention mechanism by replacing the convolutional layer in the FPN with RSEConv, to improve the representation ability of the feature map. RSEConv consists of two parts, the residual structure and Squeeze-and-Excitation(SE) block Hu et al. ( 2018 ) , as shown in Figure 6 . The number of channels of the lightweight FPN of PP-OCRv2 is relatively small, and the Squeeze-and-Excitation module may suppress some channels containing important features. The introduction of residual structure in RSEConv can alleviate the above problems and improve the text detection performance. By updating the FPN structure of the student model of CML to RSE-FPN, the hmean of the student model can be further improved from 84.3% to 85.4%.

Figure 6: The schematic diagram of RSE-FPN.

The recognition module of PP-OCRv3 is optimized based on the text recognition algorithm SVTR Du et al. ( 2022 ) . SVTR no longer adopts RNN(Recurrent Neural Network) by introducing transformers structure, which can mine the context information of text line image more effectively, so as to improve the ability of text recognition. To make SVTR a more practical model, we adopt six strategies to optimize and accelerate, as shown in Figure 7 .

Figure 7: The framework and training process of PP-OCRv3 recognition model.

Lightweight Text Recognition Network SVTR_LCNet

SVTR_LCNet is a lightweight text recognition network fused by Transformer-based network SVTR Du et al. ( 2022 ) and lightweight CNN-based network PP-LCNet Cui et al. ( 2021 ) . Specifically, we adopted a tiny version of SVTR, named SVTR_Tiny. However, SVTR_Tiny is 10 times slower than the recognizer of PP-OCRv2 based on CRNN on CPU with MKLDNN enabled due to the limited model structure supported by the MKLDNN acceleration library, which is not practical enough. As shown in Figure 8 , the main structure in SVTR_Tiny is Mix Block, which is proved to be the most time-consuming module through analysis, so we optimize the structure in three steps to speed up and ensure the effect of the model, as shown in Figure 9 .

Figure 8: The framework of the SVTR_Tiny.

Firstly, considering the high performance of PP-LCNet, we replace the first half of the SVTR_Tiny network with the first three stages of PP-LCNet, and retain only 4 Global Mixing Blocks. Secondly, we reduce the number of Global Mix Blocks from 4 to 2. Thirdly, as we found that the prediction speed of the Global Mix Block is related to the shape of the input features, the Global Mix Block is moved behind the pooling layer. Finally, we get the SVTR_LCNet, which is shown in Figure 9 (c).

Connectionist Temporal Classification (CTC) and attention mechanism are two main approaches used in recent scene text recognition works. Compared with attention-based methods, CTC decoder can achieve a much faster inference speed, but a lower accuracy. To obtain an efficient and effective model, we adopt the GTC Hu et al. ( 2020 ) methods. We use an attention module to guide the training of CTC to fuse multiple features, which is an effective strategy to improve text recognition accuracy. No more time-consuming is added in the inference process as the attention module is completely removed during prediction.

Mining Text Context Information (TextConAug)

TextConAug is a data augmentation strategy for mining textual context information. The main idea comes from the paper ConCLR Zhang et al. ( 2022 ) , in which the author proposes data augmentation strategy ConAug to concat 2 different images in a batch to form new images and perform self-supervised comparative learning. We apply this method to supervised learning tasks, and design the TextConAug data augmentation method, which can enrich the context information of training data and improve the diversity of training data.

Self-Supervised Pre-trained Model (TextRotNet)

TextRotNet is a pre-trained model trained with a large amount of unlabeled text line data in a self-supervised manner, referred to the paper STR-Fewer-Labels Baek et al. ( 2021 ) . This model can initialize the weights of SVTR_LCNet, which helps the text recognition model to converge better.

Unified-Deep Mutual Learning (U-DML)

UDML is a strategy proposed in PP-OCRv2 which is very effective to improve the accuracy of model. In PP-OCRv3, for two different structures SVTR_LCNet and attention module, the feature map of PP-LCNet, the output of the SVTR module and the output of the Attention module between them are simultaneously supervised and trained.

Unlabeled Images Mining (UIM)

UIM is a very simple unlabeled data mining strategy. The main idea is to use a high-precision text recognition model to predict unlabeled images to obtain pseudo-labels, and select samples with high prediction confidence as training data for training lightweight models.

We perform experiments on the datasets as shown in Table 1 , which is expanded on the basis of what we used in our previous work PP-OCR Du et al. ( 2020 ) and PP-OCRv2 Du et al. ( 2021 ) .

For text detection, there are 127k training images and 200 validation images. The training images consist of 68K real scene images and 59K synthetic images. The real scene images are collected from Baidu image search and public datasets, including LSVT Sun et al. ( 2019 ) , RCTW-17 Shi et al. ( 2017 ) , MTWI 2018 He and Yang ( 2018 ) , CASIA-10K He et al. ( 2018 ) , SROIE Huang et al. ( 2019 ) , MLT 2019 Nayef et al. ( 2019 ) , BDI Karatzas et al. ( 2011 ) , MSRA-TD500 Yao et al. ( 2012 ) and CCPD 2019 Xu et al. ( 2018 ) . The synthetic images mainly focus on the scenarios for long texts, multi-direction texts and texts in table. The validation images are all from real scenes.

For text recognition, there are 18.5M training images and 18.7K validation images. Among the training images, 7M images are real scene images, which come from some public datasets and Baidu image search. The public datasets include LSVT, RCTW-17, MTWI 2018, CCPD 2019, openimages https://github.com/openimages/dataset and InvoiceDatasets https://github.com/FuxiJia/InvoiceDatasets . Besides, we scraped 750k financial report images from the web. We get 810k images from LSVT unlabeled data by using UIM strategy. We also obtain about 3M croped images from Pubtabnet https://github.com/ibm-aur-nlp/PubTabNet . The remaining 11.5M synthetic images mainly focus on scenarios for different backgrounds, rotation, perspective transformation, noising, vertical text, etc. The corpus of synthetic images comes from the real scene images. All the validation images also come from the real scenes.

In addition, we collect 800 images for different real application scenarios to evaluate the overall OCR system, including contract samples, license plates, nameplates, train tickets, test sheets, forms, certificates, street view images, business cards, digital meter, etc. Figure 10 show some images of the test set.

Figure 10: Some images in the end-to-end test set.

The data synthesis tool used in text detection and text recognition is modified from text render Sanster ( 2018 ) .

Implementation Details

We adopt most of the strategies used in PP-OCRv2, as you can found in Figure 2 . We use Adam optimizer to train all the models, setting the initial learning rate to 0.001. The difference is that we adopt cosine learning rate decay as the learning rate schedule for the training of detection model, but piece-wise decay for recognition model training. Besides, we use weight decay 3e-5 for recognition model but 1e-5 for CTC Head. For detection model training, we use 5e-5 for weight decay. Warm-up training for a few epochs at the beginning is utilized for both detection and recognition models training.

For text detection, the model is trained for 500 epochs in total with warm-up training for 2 epochs. The batch size is set to 8 per card. For text recognition, the model warm up for 5 epochs and is then trained for 700 epochs with the initial learning rate 0.001, and then trained for 100 epochs with learning rate decayed to 0.0001. The batch size is 128 per card.

In the inference period, Hmean is used to evaluate the performance of the text detector and the end-to-end OCR system. Sentence Accuracy is used to evaluate the performance of the text recognizer. GPU inference time is tested on a single T4 GPU. CPU inference time is tested on a Intel(R) Xeon(R) Gold 6148.

PP-OCRv3 text detector adopts CML distillation strategy which involves a teacher model and two student models as shown in Figure 3 . We firstly optimize the network of the teacher model and the student models respectively, and then use the optimized teacher model to guide the training of the student models. For the teacher model, LK-PAN is integrated and DML is adopted to further improve the effect. For the student model, RSE-FPN is integrated and is guided.

The ablation study is shown in Table 2 . The table can be divided into two parts according to the double horizontal lines. The upper part is the experimental results of the teacher model, and the lower part is the experimental results of the student model. The teacher model is for better effect and does not consider efficiency while the student model needs to consider both.

Table 2: Ablation study of enhancement strategies for text detection. DB-R50 means DB text detection model with ResNet50 backbone. DB-R50-LK-PAN replaces DB-R50’s FPN module with LK-PAN. DB-R50-LK-PAN-DML is DB-R50-LK-PAN trained using the DML method. DB-MV3 means DB model with MobileNetV3 backbone. DB-MV3-RSE-FPN replaces DB-MV3’s FPN module with RSE-FPN. DB-MV3-CML is distilled from DB-R50-LK-PAN-DML. DB-MV3-RSE-FPN-CML is distilled from DB-R50-LK-PAN-DML with RSE-FPN integrated.

The baseline of the teacher model is a DB model with a backbone of ResNet50, named DB-R50. It can be found that Hmean can be improved from 83.5% to 85.0% by using LK-PAN, with the inference time cost increasing from 260ms to 396ms. Using DML distillation, Hmean of teacher model can be further improved to 86.0%.

For student model, comparing the two experiments marked with *, which means experiments without CML distilling method, we can find that the Hmean can be improved from 81.3% to 84.5% with RSE-FPN, with the inference time increasing by only 6%. Student1* is equivalent to PP-OCR mobile detector, while PP-OCRv2 detector is the upgraded version of PP-OCR mobile detector by using CML.

Furthermore, we verify the effectiveness of the combination of the optimized teacher and student models in CML. The DB-MV3-CML model is trained using the CML method, guided by the teacher model DB-R50-LK-PAN-DML. The Hmean is improved from 83.2% to 84.3%. If we introduce RSE-FPN in the student model of DB-MV3-CML, the Hmean can be improved from 84.3% to 85.4%. Finally, we adopt DB-MV3-RSE-FPN-CML as the text detection model for PP-OCRv3. The visualization comparison of PP-OCRv2 and PP-OCRv3 text detection model is shown in Figure 11 .

Table 3 shows the ablation study of SVTR_LCNet. We choose the PP-OCRv2-baseline as our baseline, which use PP-LCNet, BiLSTM with hidden size 48 and CTC decoder but U-DML. Comparing SVTR_Tiny with PP-OCRv2-baseline, the accuracy can be improved by 10.8%, while the prediction speed nearly 11 times slower. After replacing the first half of the SVTR_Tiny network with the first three stages of PP-LCNet, retain 4 Global Mix Blocks, the accuracy is 76%, and the speedup is 69%. Then we further reduce the number of Global Mix Blocks from 4 to 2, the accuracy is 72.9%, and the speedup is 69%. After moving the Global Mix Block behind the pooling layer, the accuracy dropped to 71.9%, and the speed surpassed the CNN-based PP-OCRv2-baseline by 22%. In addition, the height of the input image is further increased from 32 to 48, which makes the prediction speed slightly slower, but the model effect greatly improved. The recognition accuracy reaches 73.98%, which is close to the accuracy of PP-OCRv2 recognizer trained with the distillation strategy.

Table 3: Ablation study of SVTR_LCNet. G4 means 4 Global mix blocks are used, G2 means 2 Global mix blocks are used, h32 means the height of input image is 32pixel, h48 means the height of input image is 48pixel. The speed is tested on CPU.

Table 4 shows the ablation study of text recognition optimization strategies of PP-OCRv3 recognizer. Comparing SVTR_LCNet with PP-OCRv2, the accuracy of SVTR_LCNet is close to the accuracy of PP-OCRv2 recognizer trained with the distillation strategy, and the speed is 11% faster. The GTC method can improve the accuracy by 1.82%, and no more time-consuming is added in the inference process as the attention module is completely removed during prediction. By using TextConAug, the accuracy is further improved by 0.5%. The TextRotNet method can improve the accuracy by another 0.6%. Furthermore, the accuracy can be improved by 1.5% by using U-DML, which is a significant improvement. By using UIM to mine unlabeled data, the accuracy can be improved by another 1.0%. Figure 12 show some examples tested by PP-OCRv3 and PP-OCRv2 recognizer.

Table 4: Ablation study of PP-OCRv3 recognition. + means a new strategy is used based on previous strategy. The speed is tested on CPU.

Figure 11: Comparison of text detection visualization effects between PP-OCRv2 and PP-OCRv3 text detection models. The upper and lower figures are visualizations of PP-OCRv2 and PP-OCRv3, respectively.

Figure 12: Comparison of text recognition visualization effects between PP-OCRv2 and PP-OCRv3. The left column shows test images, the middle column shows test results using PP-OCRv2, the right column shows test results using PP-OCRv3.

In Table 5 , we compare the performance between proposed PP-OCRv3 with the previous ultra lightweight PP-OCR system. As we can see, the Hmean of PP-OCRv3 is 5.3% higher than that of PP-OCRv2 with the same inference cost on CPU. The inference speed of PP-OCRv3 is 22% faster than PP-OCRv2 on T4 GPU. And the visualization comparison of PP-OCRv2 and PP-OCRv3 is shown in Figure 1 .

J. Baek, Y. Matsui, and K. Aizawa (2021) What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels . In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Cited by: §2.2 . C. Cui, T. Gao, S. Wei, Y. Du, R. Guo, S. Dong, B. Lu, Y. Zhou, X. Lv, Q. Liu, X. Hu, D. Yu, and Y. Ma (2021) PP-lcnet: a lightweight cpu convolutional neural network . External Links: 2109.15099 Cited by: §1 , §2.2 . Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, and Y. Jiang (2022) SVTR: scene text recognition with a single visual model . ArXiv abs/2205.00159 . Cited by: §1 , §2.2 , §2.2 . Y. Du, C. Li, R. Guo, C. Cui, W. Liu, J. Zhou, B. Lu, Y. Yang, Q. Liu, X. Hu, et al. (2021) PP-ocrv2: bag of tricks for ultra lightweight ocr system . arXiv preprint arXiv:2109.03144 . Cited by: §1 , §3.1 . Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai, Z. Yu, Y. Yang, Q. Dang, et al. (2020) PP-ocr: a practical ultra lightweight ocr system . arXiv preprint arXiv:2009.09941 . Cited by: §1 , §3.1 . M. He and Z. Yang (2018) ICPR 2018 contest on robust reading for multi-type web images (mtwi) . Note: https://tianchi.aliyun.com/competition/entrance/231651/information Cited by: §3.1 . W. He, X. Zhang, F. Yin, and C. Liu (2018) Multi-oriented and multi-lingual scene text detection with direct regression . IEEE Transactions on Image Processing 27 ( 11 ), pp. 5406–5419 . Cited by: §3.1 . J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks . In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 7132–7141 . Cited by: §2.1 . W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin (2020) GTC: guided training of ctc towards efficient and accurate scene text recognition . In AAAI , Cited by: §2.2 . Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. Jawahar (2019) Icdar2019 competition on scanned receipt ocr and information extraction . In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pp. 1516–1520 . Cited by: §3.1 . D. Karatzas, S. R. Mestre, J. Mas, F. Nourbakhsh, and P. P. Roy (2011) ICDAR 2011 robust reading competition-challenge 1: reading text in born-digital images (web and email) . In 2011 International Conference on Document Analysis and Recognition , pp. 1485–1490 . Cited by: §3.1 . M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai (2020) Real-time scene text detection with differentiable binarization. . In AAAI , pp. 11474–11481 . Cited by: §1 , S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation . In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 8759–8768 . Cited by: §2.1 . N. Nayef, Y. Patel, M. Busta, P. N. Chowdhury, D. Karatzas, W. Khlif, J. Matas, U. Pal, J. Burie, C. Liu, et al. (2019) ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019 . In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pp. 1582–1587 . Cited by: §3.1 . Sanster (2018) Generate text images for training deep learning ocr model . Note: https://github.com/Sanster/text_renderer Cited by: §3.1 . B. Shi, X. Bai, and C. Yao (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition . IEEE transactions on pattern analysis and machine intelligence 39 ( 11 ), pp. 2298–2304 . Cited by: §1 , B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai (2017) ICDAR2017 competition on reading chinese text in the wild (rctw-17) . In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , Vol. 1 , pp. 1429–1434 . Cited by: §3.1 . Y. Sun, J. Liu, W. Liu, J. Han, E. Ding, and J. Liu (2019) Chinese street view text: large-scale chinese text reading with partially supervised learning . In Proceedings of the IEEE International Conference on Computer Vision , pp. 9086–9095 . Cited by: §3.1 . Z. Xu, W. Yang, A. Meng, N. Lu, H. Huang, C. Ying, and L. Huang (2018) Towards end-to-end license plate detection and recognition: a large dataset and baseline . In Proceedings of the European conference on computer vision (ECCV) , pp. 255–271 . Cited by: §3.1 . C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu (2012) Detecting texts of arbitrary orientations in natural images . In 2012 IEEE conference on computer vision and pattern recognition , pp. 1083–1090 . Cited by: §3.1 . X. Zhang, B. Zhu, X. Yao, Q. Sun, R. Li, and B. Yu (2022) Context-based contrastive learning for scene text recognition . In AAAI , Cited by: §2.2 . Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning . In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 4320–4328 . Cited by: §2.1 . Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2017) Deep mutual learning . CoRR abs/1706.00384 . External Links: Link , 1706.00384 Cited by: §1 .

If you find a rendering bug, file an issue on GitHub . Or, have a go at fixing it yourself – the renderer is open source !

For everything else, email us at [email protected] .