Using Solr and TikaOCR to search text inside an image

Tesseract is probably the most accurate open source OCR engine available and with Apache Tika 1.7 you can now use the awesome Tesseract OCR parser within Tika!

Solr 5.x has support for Tika 1.7 (See this). I wanted to try this in Solr 5.2, so I configured this on my machine. Below are the steps required to make TikaOCR work with Solr 5.2:

Tika OCR works OOTB in Solr. We just need to install Tesseract and set class path accordingly. You can download Tesseract from here. This is a native tool and it does the actual work. I have used version 3.02.02 in my case.

Next step is to enable extract request handler in solrconfig.xml.

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>

We will use the image below for OCR.

OM_1

Now from the Solr admin, we can upload the image file to 'Extracting request' handler with following parameters: literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true

Sample Url is:

http://{$solr_host}:{$solr_port}/solr/techproducts/update/extract?wt=json&literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true

See the image below for more information.

SolrImageExtract

Click Submit. If everything goes well, this image will be indexed with id "d1″. We can query Solr for this id to see what Tika has extracted from the image. It has extracted many features from uploaded image.

{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"q":"id:d1",
"indent":"true",
"fl":"*",
"wt":"json"
}
},
"response":{
"numFound":1,
"start":0,
"docs":[
{
"id":"d1",
"attr_stream_size":[
"55422"
],
"attr_x_parsed_by":[
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.ocr.TesseractOCRParser",
"org.apache.tika.parser.jpeg.JpegParser"
],
"attr_stream_content_type":[
"image/jpeg"
],
"attr_resolution_units":[
"inch"
],
"attr_stream_source_info":[
"the-file"
],
"attr_compression_type":[
"Progressive, Huffman"
],
"attr_data_precision":[
"8 bits"
],
"attr_number_of_components":[
"3"
],
"attr_tiff_imagelength":[
"286"
],
"attr_component_2":[
"Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert"
],
"attr_component_1":[
"Y component: Quantization table 0, Sampling factors 2 horiz/2 vert"
],
"attr_image_height":[
"286 pixels"
],
"attr_x_resolution":[
"72 dots"
],
"attr_image_width":[
"690 pixels"
],
"attr_stream_name":[
"OM_1.jpg"
],
"attr_component_3":[
"Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert"
],
"attr_tiff_bitspersample":[
"8"
],
"attr_tiff_imagewidth":[
"690"
],
"content_type":[
"image/jpeg"
],
"attr_y_resolution":[
"72 dots"
],
"attr_content":[
" \n \n \n \n \n \n \n \n \n \n \n ' '\"I" \" \"' ./\nlrast. Shortly before the classes started I was visiting a.\ncertain public school, a school set in a typically English\ncountryside, which on the June clay of my visit was wonder-\nfully beauliful. The Head Master—-no less typical than his\nschool and the country-side—pointed out the charms of\nboth, and his pride came out in the final remark which he made\nbeforehe left me. He explained that he had a class to take\nin'I'heocritus. Then (with a. buoyant gesture); " Can you\n\n, conceive anything more delightful than a class in Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n "
],
"_version_":1506952201156689920
}
]
}
}

Here in the above response “attr_content” is the actual field where OCR output is stored.

"attr_content":[
" \n \n \n \n \n \n \n \n \n \n \n ' '\"I" \" \"' ./\nlrast. Shortly before the classes started I was visiting a.\ncertain public school, a school set in a typically English\ncountryside, which on the June clay of my visit was wonder-\nfully beauliful. The Head Master—-no less typical than his\nschool and the country-side—pointed out the charms of\nboth, and his pride came out in the final remark which he made\nbeforehe left me. He explained that he had a class to take\nin'I'heocritus. Then (with a. buoyant gesture); " Can you\n\n, conceive anything more delightful than a class in Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n "

These results are very much accurate. If in case the results are not so accurate, then we can use some context-based or linguistic error correction algorithm to clean/correct text.

1 Comments

Brian Hagan

September 17, 2015 at 4:54 am

Thanks Vijay, I used this post successfully. The tricky part for me was building leptonica on Centos due to pathing and the required dependencies.

Request a Live Demonstration

Take the first 2 steps towards relevant deep insights with our 3RDi Enterprise Platform
  1. Register and get the test login details for 3RDi Enterprise Suite
  2. Take a tour, start experiencing

Existing users can click here


Reach out to us for any questions or queries

By submitting this form, you are consenting to receive emails from The Digital Group team, you can unsubscribe at any time.