sql - NoSQL for searching millions of pages? -


I have been provided with aprox 4-5 million images of old documents which my company has decided to remove. We are trying to go paperless but I am facing a problem which I am unable to fully understand. I have always used SQL for the amount of this data but now I have only images. I have already bought ABBYY Fine Reader OCR and it is currently working on OCR for all the files on Word or PDF. The problem is that they want to search this data in less than 7-10 seconds within this large scale and want to get all the results with the download link in the original image of the file.

I do not have any SQL but I think this is not the best way because I do not need to create a table with any schema and just the entire text of each image is the same page number and the link to the original file Will have to be linked with. According to my knowledge this age will take birth, what other solution can I use?

A document to search on a set, usually the best to create a reverse index The solution is Here I believe that you want to support the actions provided by Google, Bing, etc. ... but on your data.

The creation of a reverse index usually involves dividing the documents in words, and separately separates them into each indicator entry in the reverse index as a key word, and the name of the document (Or some other identifier of the document), and the document will contain the word as the location value.

You can do it manually, but it is not so trivial to parse the documents, remove words, eliminate non-critical words and index them. Using a dedicated product is easy.

Most RDBMS integers support extensions that provide indexing indexing. For example:

  • Generally, these RDBMS extensions Are less efficient than typical engines.

  • P > I think none of these products can index a few millions of documents.


  • Comments

    Popular posts from this blog

    java - org.apache.http.ProtocolException: Target host is not specified -

    java - Gradle dependencies: compile project by relative path -

    ruby on rails - Object doesn't support #inspect when used with .include -