I have installed and setup elastic search and ingest-attachment plugin.

I need to search through a list of pdf files (20000) given in a file path how would I do.

  • How do you push the data to the elastic search , is there a way to mention the file path directly to elastic search in the request itself. (prefer not to use any programming language like C# or python etc.).
    Note:
    I Used FS Crawler to import the PDF file contents from a local file system path into Elastic Search.
  • How to push the file contents into ingest node.
  • Hows ingest-attachment plugin works.
  • I need to restrict the search results based on user access. How to achieve it.
  • I believe that FSCrawler does that. If I understood what you asked for
  • You need to serialize the binary to BASE64 and send the BASE64 within a field of your json document. There's a demo in documentation.
  • It uses Tika behind the scene to extract the text from the document and put it into your source document
  • You can use security feature of elastic stack. You need to activate a trial license or buy a platinum license. Or use cloud.elastic.co
    Hi David ,
       Thank you for your early responses. It would be greatful if u clarify the below mentioned doubts.
    

    C:\Users\Administrator.fscrawler\job1_settings.json
    "elasticsearch" : {
    "index" : "jobindex1",
    "index_folder" : "jobfoldersindex1",
    "pipeline" : "fscrawler",
    "nodes" : [ {
    "url" : "http://127.0.0.1:9200"
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "byte_size" : "25mb"

    the above is my settings in fscrawler w.r.t elasticsearch.

    Request to create a pipeline
    PUT _ingest/pipeline/fscrawler
    "description" : "fscrawler pipeline",
    "processors" : [
    "set" : {
    "field": "foo",
    "value": "bar"

    Files are imported into elasticsearch successfully.

    Request to get files which contains the below mentioned string
    GET /jobindex1/_search
    "query" : {
    "match": {
    "content" : "emad"

    Result:
    "took" : 0,
    "timed_out" : false,
    "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
    "hits" : {
    "total" : 1,
    "max_score" : 3.15744,
    "hits" : [
    "_index" : "jobindex1",
    "_type" : "_doc",
    "_id" : "7bfc1ba6cb2ea96a7cea1b84f4dbd",
    "_score" : 3.15744,
    "_source" : {
    "path" : {
    "virtual" : "/Attendance_Appraisals_07022017120507.pdf",
    "root" : "8f384e4d1aa1e6127ed1195953dccce3",
    "real" : """D:\PDf list\Attendance_Appraisals_07022017120507.pdf"""
    "file" : {
    "extension" : "pdf",
    "last_accessed" : "2019-01-13T06:41:24.289+0000",
    "filename" : "Attendance_Appraisals_07022017120507.pdf",
    "content_type" : "application/pdf",
    "created" : "2019-01-13T06:41:24.289+0000",
    "indexing_date" : "2019-01-13T12:04:23.798+0000",
    "filesize" : 509153,
    "last_modified" : "2017-02-07T09:05:03.872+0000",
    "url" : """file://D:\PDf list\Attendance_Appraisals_07022017120507.pdf"""
    "meta" : {
    "created" : "2017-02-07T02:59:52.000+0000",
    "format" : "application/pdf; version=1.3",
    "raw" : {
    "pdf:PDFVersion" : "1.3",
    "X-Parsed-By" : "org.apache.tika.parser.pdf.PDFParser",
    "xmp:CreatorTool" : "Canon ",
    "access_permission:modify_annotations" : "true",
    "access_permission:can_print_degraded" : "true",
    "meta:creation-date" : "2017-02-07T06:59:52Z",
    "created" : "2017-02-07T06:59:52Z",
    "access_permission:extract_for_accessibility" : "true",
    "access_permission:assemble_document" : "true",
    "xmpTPg:NPages" : "1",
    "Creation-Date" : "2017-02-07T06:59:52Z",
    "resourceName" : "Attendance_Appraisals_07022017120507.pdf",
    "dcterms:created" : "2017-02-07T06:59:52Z",
    "dc:format" : "application/pdf; version=1.3",
    "access_permission:extract_content" : "true",
    "access_permission:can_print" : "true",
    "pdf:docinfo:creator_tool" : "Canon ",
    "access_permission:fill_in_form" : "true",
    "pdf:encrypted" : "false",
    "producer" : " ",
    "access_permission:can_modify" : "true",
    "pdf:docinfo:producer" : " ",
    "pdf:docinfo:created" : "2017-02-07T06:59:52Z",
    "Content-Type" : "application/pdf"
    "creator_tool" : "Canon "
    "foo" : "bar",
    "content" : """

    INTER OFFICE MEMO Dear All Employees ln reference to the above mentioned subject, irrespective of numerous correspondences, it has been noticed that many employees are still reporting to work late on many occasions. The grace period for morning Punch lN time is only 15 minutes from the official start timing irrespective of Head Office or Sites. Late attendance will be deducted from the monthly salary. Also PUNCH IN/OUT is mandatory. The Missed Punching will also be considered as Absent. Also note that the late Punching and related deduction will be affecting the Performance Appraisal of the employees. ln view of all the above all staff are requested to do proper attendance punching and if any technical issue please coordinate with the IT/HR department to rectify the same at the earliest will be given to any staff on the attendance punching E Janabi HR & Admin Manager Ref No. Trojan/lOM/HR & ADM/44581 17 Date: 07to2t2017 Pages 1 To All Staff TROJAN & NPC From Emad AI Janabi HR & Admin Manager CC: Engr. Hamad Al Ameri Managing Director Subject Attendance Regulations & Performance Appraisals P.o. Box 111059, Abu Dhabi, uAE. Tel. no. +9t1 2 so973oo - Fax: +gl1 2 5g2gs94 .i - oi I'l{( )f n N

    Why i am not able to get the result like
    "found": true,
    "_index": "my_index",
    "_type": "_doc",
    "_id": "my_id",
    "_version": 1,
    "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
    "content_type": "application/rtf",
    "language": "ro",
    "content": "Lorem ipsum dolor sit amet",
    "content_length": 28

    If you are using FSCrawler you don't need ingest attachment plugin at all as everything is done by FSCrawler.

    What is the output you need? You don't want to index the content?

    It depends.

    If you need to crawl a filesystem, then FSCrawler is good. If you just need to index one binary file you have wherever, then probably ingest attachment is ok.
    But it doesn't expose all Tika features such as OCR. In which case FSCrawler would be preferred.

    Disclaimer: I'm the author of FSCrawler so I might be biased. :wink:

  •