相关文章推荐
愉快的核桃  ·  elastic search in ...·  1 月前    · 
愉快的核桃  ·  Reporting and sharing ...·  1 月前    · 
愉快的核桃  ·  Search a PDF file ...·  1 月前    · 
愉快的核桃  ·  How to index the PDF ...·  1 月前    · 
愉快的核桃  ·  Using HBase ...·  1 月前    · 

How to index the PDF and image documents into elasticsearch. Would like to extract the entities to enable the search on keywords. Whether the workplace search provide this functionality? Whether Apache Tika has been used within elasticsearch or the NLP modules to accomplish this functionality.?

Primarily would like to index few thousands of PDF/Image documents from

  • Local file system (Windows/Linux)
  • AWS S3 buscket.
  • You can use the ingest attachment plugin .

    There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

    PUT _ingest/pipeline/attachment
      "description" : "Extract attachment information",
      "processors" : [
          "attachment" : {
            "field" : "data"
    PUT my_index/_doc/my_id?pipeline=attachment
      "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
    GET my_index/_doc/my_id
    

    The data field is basically the BASE64 representation of your binary file.

    You can use FSCrawler. There's a tutorial to help you getting started.

    Thanks David for your quick response.

    I have seen both these options. FSCrawler looks to be the best option. It can feed to Workplace search as well, which provides us with nice UI for search along with facets.
    If I want to use workplace search with the source as Onedrive or Sharepoint online, Whether the same functionality can be achieved?

    If we want to use the ingest attachment plugin, how to feed the documents (PDF/IMAGE) in bulk?

    Also I am looking into the NLP ML models which are being used in elasticsearch can help to tag these documents with relevant tags. That way the search can be done with the exact value of the identified tags.