Deleting Entries in Elasticsearch Based On Timestamp

It’s inevitable after ingesting lots of server logs into Elasticsearch, there’s a requirement to delete partial logs, either they were incorrect data or loaded more than once.  When there are millions of data, it’s just inefficient to drop all of the index and start over from the beginning.  Luckily, there’s a solution by using Elasticsearch range by query API:

POST apachelogs-2018.11.02/_delete_by_query?wait_for_completion=false
{
   "query": {
      "range": {
          "@timestamp": { 
               "gte" : "02/11/2018",
               "lte" : "02/11/2018",
               "time_zone": "-07:00",
               "format": "dd/MM/yyyy||yyyy"
          }
      }
   }
}

The directive ?wait_for_completion=false is for use in Kibana dev tools since the GUI will give a gateway timeout if the task takes more than 30 seconds.  Instead, the option will send the task into the background and not wait for it to complete in Kibana UI.

Another important note, the logs are stored in UTC time zone, by default.  Elastic Support and Training staff have confirmed this. Deleting without specifying a timezone will look like partial deletion.  This same problem happens when dropping just one particular day (ie. apachelogs-2018.11.12) index since the entries will overlap with the next day’s index.  Thus, in this case, since it’s a requirement to delete the entire Nov 2 timestamped data, a specific time zone (Pacific Daylight Time) “-07:00” is necessary.

The data will then look like this in Kibana’s Discovery tool:

Deleting An Entire Day Out Of Elasticsearch

Leave a Reply

Your email address will not be published.