You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Logstash: Difference between revisions
imported>Cwhite |
imported>Cwhite m (→Deployment) |
||
Line 189: | Line 189: | ||
</syntaxhighlight><br /> | </syntaxhighlight><br /> | ||
==Plugins== | ==Plugins== | ||
Logstash plugins are fetched and compiled into a Debian package for distribution and installation on Logstash servers. | |||
The plugin git repository is located at https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/logstash/plugins | |||
===Plugin build process=== | ===Plugin build process=== | ||
The build can be run on the production builder host. See [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/logstash/plugins/+/refs/heads/master/README README] for updated build steps. | |||
=====Deployment===== | |||
* Add package to [[Reprepro|reprepro]] and install on the host normally. | |||
. | {{Note|Package installation will not restart Logstash. This must be done manually in a rolling fashion, and it's strongly suggested to perform this in step with the plugin deploy.}} | ||
{{anchor|Prototype (Beta) Logstash}} | |||
==Beta Cluster Logstash== | ==Beta Cluster Logstash== | ||
;Web interface | ;Web interface | ||
:[https://logstash-beta.wmflabs.org/ logstash-beta.wmflabs.org] | :[https://logstash-beta.wmflabs.org/ logstash-beta.wmflabs.org] | ||
:NEW: [https://beta-logs.wmcloud.org/ beta-logs.wmcloud.org] | |||
: | |||
;Access control | ;Access control | ||
:Unlike production Logstash, Beta Cluster may not use LDAP for authentication. Credentials for Beta's Logstash can be found on <code>deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud</code> in <code>/root/secrets.txt</code>. | :Unlike production Logstash, Beta Cluster may not use LDAP for authentication. Credentials for Beta's Logstash can be found on <code>deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud</code> in <code>/root/secrets.txt</code>. |
Revision as of 19:24, 20 October 2021
Logstash is a tool for managing events and logs. When used generically, the term encompasses a larger system of log collection, processing, storage and searching activities.
Overview
File:ELK Tech Talk 2015-08-20.pdf
File:Using Kibana4 to read logs at Wikimedia Tech Talk 2016-11-14.pdf
Various Wikimedia applications send log events to Logstash, which gathers the messages, converts them into JSON documents, and stores them in an Elasticsearch cluster. Wikimedia uses Kibana as a front-end client to filter and display messages from the Elasticsearch cluster. These are the core components of our ELK stack, but we use additional components as well. Since we utilize more than the core ELK components, we refer to our stack as "ELK+'".
Elasticsearch
Elasticsearch is a multi-node Lucene implementation. The same technology powers the CirrusSearch on WMF wikis.
Logstash
Logstash is a tool to collect, process, and forward events and log messages. Collection is accomplished via configurable input plugins including raw socket/packet communication, file tailing, and several message bus clients. Once an input plugin has collected data it can be processed by any number of filters which modify and annotate the event data. Finally logstash routes events to output plugins which can forward the events to a variety of external programs including Elasticsearch, local files and several message bus implementations.
Kibana
Kibana is a browser-based analytics and search interface for Elasticsearch that was developed primarily to view Logstash event data.
Kafka
Apache Kafka is a distributed steaming system. In our ELK stack Kafka buffers the stream of log messages produced by rsyslog (on behalf of applications) for consumption by Logstash. Nothing should output logs to logstash directly, logs should always be sent by way of Kafka.
Rsyslog
Rsyslog is the "rocket-fast system for log processing". In our ELK stack rsyslog is used as the host "log agent". Rsyslog ingests log messages in various formats and from varying protocols, normalizes them and outputs to Kafka.
Elastalert
Elastalert is a utility to query log message data in order to generate alerts. Many conditions, thresholds and output mechanisms are supported.
Systems feeding into logstash
See 2015-08 Tech talk slides
Writing new filters is easy.
Supported log shipping protocols & formats ("interfaces")
Support of logs shipped directly from application to Logstash has been deprecated.
Please see Logstash/Interface for details regarding long-term supported log shipping interfaces.
Kubernetes
Kubernetes hosted services are taken care of directly by the kubernetes infrastructure which ships via rsyslog into the logstash pipeline. All a kubernetes service needs to do is log in a JSON structured format (e.g. bunyan for nodejs services) to standard output/standard error.
Systems not feeding into logstash
- EventLogging (of program-defined events with schemas), despite its name, uses a different pipeline.
- Varnish logs of the billions of pageviews of WMF wikis would require a lot more hardware. Instead we use Kafka to feed web requests into Hadoop. A notable exception to this rule: varnish user-facing errors (HTTP status 500-599) are sent to logstash to make debugging easier.
- MediaWiki logs usually go to both logstash and log files, but a few log channels aren't. You can check which in
$wmgMonologChannels
in InitialiseSettings.php.
Writing & testing filters
When in the process of writing new logstash filters, take a look at what's existing already in puppet. Each filter must be tested to avoid regressions, we are using logstash filter verifier and existing tests can be found in the tests/ directory. To write tests or run existing tests you will need logstash-filter-verifier and logstash installed locally.
Each filter has a corresponding test after its name in tests/. Within the test file the fields map lists the fields common to all tests and are used to trigger a specific filter's "if" conditions. The ignore key usually contains only @timestamp since that field is bound to change across invocations and can be safely ignored. The remainder of a test file is a list of testcases in the form of input/expected pairs. For "input" it is recommended to use yaml > to include verbatim JSON, whereas "expected" is usually yaml, although it can be also verbatim JSON if more convenient.
Production Logstash Architecture
As of FY2019 Logstash infrastructure is owned by SRE. See also Logstash/SRE_onboard for more information on how to migrate services/applications.
Architecture Diagram
Web interface
- logstash.wikimedia.org runs Kibana
Authentication
- wikitech LDAP username and password and membership in one of the following LDAP groups: nda, ops, wmf
Configuration
- The cluster contains two types of nodes, configured by puppet.
- role::logstash manages the Logstash "collector" instances. These run logstash, a no-data Elasticsearch node, and an Apache vhost serving the Kibana application. The Apache vhosts also act as reverse proxies to the Elasticsearch cluster and perform LDAP-based authentication to restrict access to the potentially sensitive log information.
- role::logstash::elasticsearch manages the Elasticsearch data nodes, and the Kafka-logging brokers. These provide the inbound message buffering, and long-term storage layer for log data.
Hostname Convention
Current
- logstash1NNN Logstash related servers in Eqiad.
- logstash2NNN Logstash related servers in Codfw.
Future
- logstashNNNN - Logstash "collector" hosts
- elasticsearch-loggingNNNN - Logstash Elasticsearch hosts
- kafka-loggingNNNN - Logstash Kafka broker hosts
Operating Systems
All hosts run Debian Stretch as a base operating system
Load Balancing and TLS
The misc Varnish cluster is being used to provide ssl termination and load balancing support for the Kibana application.
Kibana quick intro
- Start from one of the blue Dashboard links near the top, more are available from the Load icon near the top right.
- In "Events over time" click to zoom out to see what you want, or select a region with the mouse to zoom in.
- smaller time intervals are faster
- be careful you may see no events at all... because you're viewing the future
- When you get lost, click the Home icon near the top right
- As an example query,
wfDebugLog( 'Flow', ...)
in MediaWiki PHP corresponds totype:mediawiki AND channel:flow
- switch to using mw:Structured logging and you can query for ...
AND level:ERROR
- switch to using mw:Structured logging and you can query for ...
Read slide 11 and onwards in the TechTalk on ELK by Bryan Davis, they highlight features of the Kibana web page.
Common Logging Schema
Seeː Logstash/Common Logging Schema.
API
The Elasticsearch API is accessible at https://logstash.wikimedia.org/elasticsearch/
Note: The _search endpoint can only be used without a request body (see task T174960). Use _msearch instead for complex queries that need a request body.
Extract data from Logstash with Python
To get the last 100 log entries matching the Lucene query logger_name:varnishfetcherr AND layer:backend
#!/usr/bin/env python3
import os
import sys
import json
import requests
query = "logger_name:varnishfetcherr AND layer:backend"
results = 100
ldap_user = os.getenv("LDAP_USER")
ldap_pass = os.getenv("LDAP_PASS")
if ldap_user is None or ldap_pass is None:
print("You need to set LDAP_USER and LDAP_PASS")
sys.exit(1)
url = "https://logstash.wikimedia.org/elasticsearch/_search?size={}&q={}".format(
results, query
)
resp = requests.get(url, auth=requests.auth.HTTPBasicAuth(ldap_user, ldap_pass))
if resp.status_code != 200:
print("Something's wrong, response status code={}".format(resp.status_code))
sys.exit(1)
data = resp.json()
for line in data["hits"]["hits"]:
print(json.dumps(line["_source"]))
Note: Certain queries with whitespace characters may require additional url-encoding (via urllib.parse.quote
or similar) when using python requests
. If requests to the logstash API consistently return 504 http status codes, even for relatively lightweight queries, this may be the issue.
Extract data from Logstash (elasticsearch) with curl and jq
logstash-server:~$ cat search.sh
curl -XGET 'localhost:9200/_search?pretty&size=10000' -d '
{
"query": {
"query_string" : {
"query" : "facility:19,local3 AND host:csw2-esams AND @timestamp:[2019-08-04T03:00 TO 2019-08-04T03:15] NOT program:mgd"
}
},
"sort": ["@timestamp"]
} '
logstash-server:~$ bash search.sh | jq '.hits.hits[]._source | {timestamp,host,level,message}' | head -20
{
"timestamp": "2019-08-04T03:00:00+00:00",
"host": "csw2-esams",
"level": "INFO",
"message": " %-: (root) CMD (newsyslog)"
}
{
"timestamp": "2019-08-04T03:00:00+00:00",
"host": "csw2-esams",
"level": "INFO",
"message": " %-: (root) CMD ( /usr/libexec/atrun)"
}
{
"timestamp": "2019-08-04T03:01:00+00:00",
"host": "csw2-esams",
"level": "INFO",
"message": " %-: (root) CMD (adjkerntz -a)"
}
$ bash search.sh | jq -r '.hits.hits[]._source | {timestamp,host,level,program,message} | map(.) | @csv' > asw2-d2-eqiad-crash.csv
Plugins
Logstash plugins are fetched and compiled into a Debian package for distribution and installation on Logstash servers.
The plugin git repository is located at https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/logstash/plugins
Plugin build process
The build can be run on the production builder host. See README for updated build steps.
Deployment
- Add package to reprepro and install on the host normally.
![]() | Package installation will not restart Logstash. This must be done manually in a rolling fashion, and it's strongly suggested to perform this in step with the plugin deploy. |
Beta Cluster Logstash
- Web interface
- logstash-beta.wmflabs.org
- NEW: beta-logs.wmcloud.org
- Access control
- Unlike production Logstash, Beta Cluster may not use LDAP for authentication. Credentials for Beta's Logstash can be found on
deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud
in/root/secrets.txt
.
Gotchas
GELF transport
Make sure logging events sent to the GELF input don't have a "type" or "_type" field set, or if set, that it contains the value "gelf". The gelf/logstash config discards any events that have a different value set for "type" or "_type". The final "type" seen in Kibana/Elasticsearch will be take from the "facility" element of the original GELF packet. The application sending the log data to Logstash should set "facility" to a reasonably unique value that identifies your application.
Documents
Troubleshooting
Kafka consumer lag
For a host of reasons it might happen that there's a buildup of messages on Kafka. For example:
- elasticsearch is refusing to index messages, thus logstash can't consume properly from kafka.
- The reason for index failure is usually conflicting fields, see also bug T150106 for a detailed discussion of the problem. The solution is to find what programs are generating the conflicts and drop them on logstash accordingly, see also bug T228089
Using the dead letter queue
The Logstash DLQ is not enabled normally, however it comes handy when debugging indexing failures and the problematic log entries don't show up in the logstash logs.
Enable (with puppet disabled) the DLQ in /etc/logstash/logstash.yml
dead_letter_queue.enable: true path.dead_letter_queue: "/var/lib/logstash/dead_letter_queue/"
And systemctl restart logstash. The DLQ will start filling up as soon as unindexable logs are received. At a later time the DLQ can be dumped with (running as logstash user)
$ /usr/share/logstash/bin/logstash -e ' input { dead_letter_queue { path => "/var/lib/logstash/dead_letter_queue/" commit_offsets => false pipeline_id => "main" } } output { stdout { codec => rubydebug { metadata => true } } } ' 2>&1 | less
Once debugging is complete, clear the queue with rm /var/lib/logstash/dead_letter_queue/main/*.log, and reenable puppet
Operations
Configuration changes
Puppet will restart logstash as needed upon configuration changes (supporting reload is TODO), after merging your configuration change it is usually enough to run cumin in batches of one with >60s of sleep:
cumin -b1 -s60 'O:logstash or O:logstash7' 'run-puppet-agent -q'
Test a configuration snippet before merge
Copy your ready to merge snippet (eg. modules/profile/files/logstash/filter-syslog-network.conf) to a Logstash host. Then run
sudo /usr/share/logstash/bin/logstash --config.test_and_exit -f <myfile>
It should return "Configuration OK".
Indexing errors
Have a look at the Dead Letter Queue Dashboard. The original message that caused the error is in the log.original
field.
We're alerting on errors that Logstash gets from Elasticsearch whenever there's an "indexing conflict" between fields of the same index (see also bug T236343). The reason usually is because two applications send logs with the same field name but two different types, e.g. response
will be sent as a string in one case but as nested object in another. bug T239458 is a good example of this, where different parts of mediawiki send logs formatted in a different way.
No logs indexed
This alert is based on the incoming logs per second indexed by Elasticsearch. During normal operation there is a baseline of ~1k logs/s (July 2020) and anything significantly lower than that is an unexpected condition. Check the Logstash dashboard attached to alert for signs of root causes. Most likely Logstash has stopped sending logs to Elasticsearch.
Drop spammy logs
Occasionally producers will outpace Logstash's ingestion capabilities, most often with what's considered "log spam" (e.g. dumping whole request/response in debug logs). In these case one solution is to drop the offending logs from Logstash, and ideally the producer has already stopped spamming. The simplest such filter is installed before most/all other filters, matches a few fields and then drop
s the message:
filter { if [program] == "producer" and [nested][field] == "offending value" { drop {} } }
See also this Gerrit change for a real-world example.
UDP packet loss
Logstash 5 locks up from time to time, causing UDP packet loss on the host it is running on. The fix in this case is to restart logstash.service
on the host in question.
Stats
Documents and bytes counts
The Elasticsearch cat API provides a simple way to extract general statistics about log storage, e.g. total logs and bytes (not including replication)
logstash1010:~$ curl -s 'localhost:9200/_cat/indices?v&bytes=b' | awk '/logstash-/ {b+=$10; d+=$7} END {print d; print b}'
Or logs per day (change $3 to $7 to get bytes sans replication)
logstash1010:~$ curl -s 'localhost:9200/_cat/indices?v&bytes=b' | awk '/logstash-/ { gsub(/logstash-[^0-9]*/, "", $3); sum[$3] += $7 } END { for (i in sum) print i, sum[i] }' | sort
Or logs per month:
logstash1010:~$ curl -s 'localhost:9200/_cat/indices?v&bytes=b' | awk '/logstash-/ { gsub(/logstash-[^0-9]*/, "", $3); gsub(/\.[0-9][0-9]$/, "", $3); sum[$3] += $7 } END { for (i in sum) print i, sum[i] }' | sort
Data Retention
Logs are retained in Logstash for a maximum of 90 days by default in accordance with our Privacy Policy and Data Retention Guidelines.
Longer Retention
In order for logs to be retained for longer than 90 days:
- Create a request to increase retention for a log stream on Phabricator, tagging the Observability team.
- Audit the log stream to determine it is non-personal information not associated with a user account. The Observability, Security, and Legal teams will review the request.
- Data including but not limited to usernames, email addresses, IP addresses, user agent data, TLS data, cookie data, and specific location data must be removed.
- Once the audit is complete and approved, the log stream will be tagged for inclusion into long-term retention indexes according to cluster capacity.
Common Logging Schema fields indicating PII
These fields (as of ECS 1.7.0) have been identified as likely containing personal information.
![]() | On the labels, message, and log.original fields: The code paths populating these fields will need to be audited and demonstrate that personal and/or nonpublic information cannot be inadvertently written to these fields. |
- client.*
- error.message
- geo.*
- http.request.body.content
- http.response.body.content
- http.headers.*
- labels
- log.original
- message
- source.*
- tls.client.*
- user.*
- user_agent.*
See also
- mw:Manual:Structured logging (MediaWiki part of the job to feed into Logstash)
- Logs#mw-log (the old method of viewing logs)
- Introducing Phatality (a Kibana plugin to streamline the process of reporting production errors on Phabricator)
- Kubernetes/Logging How logs flow into Logstash from the Kubernetes components