The exponential growth of unstructured data is a reality and enterprises are facing a growing challenge of leveraging this unsurmountable amount of unstructured data for insights to drive business growth.
According to a study by the International Data Group (IDG), unstructured data is growing at an alarming rate of 62% per year. The same study also suggests that by 2022, close to 93% of all data in the digital world will be unstructured!!
The biggest challenge enterprises face when using a keyword based search platform is that a major portion of organizational data comprises of unstructured data. According to the Market Pulse Survey by SailPoint, over 71% of enterprises across the globe are not sure about how to manage unstructured data and the likely reason is that unstructured data is difficult to comprehend for a keyword based search tool.
Given the exponential growth in medical literature, finding relevant information sooner is critical. Researchers, with more content and less time to analyse it, need systems that are smart and intelligent enough to integrate the scattered content, provide quicker discovery of information and tools for thorough analysis of content.
We know that computers understand programming languages but how about making them understand human language, the language that you and me speak? Natural Language Processing (NLP) is a field of study that makes this possible, as it focuses on enabling computers to analyse, understand and derive meaning from human language in order to perform a large number of tasks.
When enterprise searches are built from scratch, evaluation of the search quality remains key challenges of organizations implementing it. It always gives a feel of living in the darkness all the time. Such implementations demand enormous efforts and time. The chart below demonstrates a typical challenging situation in which organizations invest and work consistently on maturing the quality of searches over time, and yet remain far from satisfaction.
This article explains how to implement SOLR "document level security" using Manifold Connector Framework. ManifoldCF is an open source framework for pulling content out of a repository and sending it on to targets such as SOLR via a plug-in style and connector-based architecture.
According to the docker's website, "Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications."
In simple words, it's one of the methods to run and deploy your software application. Docker allows you to create lightweight "virtual machines". Here lightweight virtual machines are nothing but docker containers.
Tesseract is probably the most accurate open source OCR engine available and with Apache Tika 1.7 you can now use the awesome Tesseract OCR parser within Tika!
Solr 5.x has support for Tika 1.7 (See this) . I wanted to try this in Solr 5.2 so I configured this on my machine, Below are the steps required to make TikaOCR work with Solr 5.2.
One of the lesser known but cool features of ReplicationHandler is support for index backup. You must have used ReplicationHandler in your project for replicating index from master to slave instances. if you want to take backup of index, you can do it as follows:
An ontology formally defines a common set of terms that are used to describe and represent a domain. An ontology is domain specific, and it is used to describe and represent an area of knowledge. It contains terms and the relationships among these terms. There is another level of relationship expressed by using a special group of terms: properties.
If you have multiple clients updating documents, it's really critical to ensure that newer version of the document is never overwritten by the older version. To address this problem, what you need is concurrency control, which is the process of managing simultaneous update of documents.