Parsing log files in python

Log File Analysis

Logs contain very detailed information about events happening on computers. And the extra details that they provide come with additional complexity that we need to handle ourselves. A pageview may contain many log lines, and a session can consist of several pageviews for example.

Another important characterisitic of log files is that their are usualy not big. They are massive.

So, we also need to cater for their large size, as well as rapid changes.

>>> import advertools as adv >>> import pandas as pd >>> adv.logs_to_df(log_file='access.log', . output_file='access_logs.parquet', . errors_file='log_errors.csv', . log_format='common', . fields=None) >>> logs_df = pd.read_parquet('access_logs.parquet') 

How to run the logs_to_df() function:

  • log_file : The path to the log file you are trying to analyze.
  • output_file : The path to where you want the parsed and compressed file to be saved. Only the parquet format is supported.
  • errors_file : You will almost certainly have log lines that don’t conform to the format that you have, so all lines that weren’t properly parsed would go to this file. This file also contains the error messages, so you know what went wrong, and how you might fix it. In some cases, you might simply take these «errors» and parse them again. They might not be really errors, but lines in a different format, or temporary debug messages.
  • log_format : The format in which your logs were formatted. Logs can (and are) formatted in many ways, and there is no right or wrong way. However, there are defaults, and a few popular formats that most servers use. It is likely that your file is in one of the popular formats. This parameter can take any one of the pre-defined formats, for example «common», or «combined», or a regular expression that you provide. This means that you can parse any log format (as long as lines are single lines, and not formatted in JSON).
  • fields : If you selected one of the supported formats, then there is no need to provide a value for this parameter. You have to provide a list of fields in case you provide a custom (regex) format. The fields will become the names of the columns of the resulting DataFrame, so you can distinguish between them (client, time, status code, response size, etc.)
Читайте также:  Инициализация объекта в javascript

Supported Log Formats

  • common
  • combined (a.k.a «extended»)
  • common_with_vhost
  • nginx_error
  • apache_error

Log File Analysis — Data Preparation

We go through an example where we prepare the data for analysis, and here is the plan:

  1. Parse the log file into a DataFrame saved to disk with a .parquet extension. A side effect is that your log file is also compressed down to 5% — 15% of its original size. It also makes it super efficient to query and analyze once in this format. Function used: logs_to_df .
  2. Convert data types as needed (optional): Most importantly converting the datetime column into a date object helps a lot in querying the data. Other possibilities include converting to categorical data types for more efficient storage and querying. Function used: pandas.to_datetime .
  3. Get the hostnames of the IP addresses of the clients sending requests. Function used: reverse_dns_lookup . We can then easily add a hostname column to the original DataFrame.
  4. Parse and split URL columns into their respective components. Typically we have request which is the resource/URL requested, as well as referer , which shows us where the request was referred from. Function used: url_to_df .
  5. Parse user agents if available. This allows us to analyze by user-agent family, operating system, bot/non-bot, version, and any other combination we want.
  6. Combine all data together, and save back to a new .parquet file, and start analyzing.
66.249.73.72 - - [16/Feb/2022:00:18:53 +0000] "GET / HTTP/1.1" 200 1095 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 109.237.103.118 - - [16/Feb/2022:00:20:39 +0000] "GET /.env HTTP/1.1" 404 209 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36" 45.12.223.214 - - [16/Feb/2022:00:23:45 +0000] "GET / HTTP/1.0" 200 2240 "http://adver.tools/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36" 51.68.77.249 - - [16/Feb/2022:00:26:23 +0000] "GET /robots.txt HTTP/1.1" 404 209 "-" "advertools/0.13.0" 51.68.77.249 - - [16/Feb/2022:00:26:23 +0000] "HEAD / HTTP/1.1" 200 0 "-" "advertools/0.13.0" 192.241.211.176 - - [16/Feb/2022:00:31:16 +0000] "GET /login HTTP/1.1" 404 209 "-" "Mozilla/5.0 zgrab/0.x" 66.249.73.69 - - [16/Feb/2022:00:48:56 +0000] "GET /robots.txt HTTP/1.1" 404 209 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.73.72 - - [16/Feb/2022:00:48:56 +0000] "GET /staging/urlytics/ HTTP/1.1" 200 520 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.73.75 - - [16/Feb/2022:00:49:38 +0000] "GET /staging/urlytics/_dash-component-suites/dash/html/dash_html_components.v2_0_0m1638886228.min.js HTTP/1.1" 200 154258 "http://www.adver.tools/staging/urlytics/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/98.0.4758.80 Safari/537.36" 66.249.73.75 - - [16/Feb/2022:00:49:39 +0000] "GET /staging/urlytics/_dash-layout HTTP/1.1" 200 2547 "http://www.adver.tools/staging/urlytics/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/98.0.4758.80 Safari/537.36" 
import advertools as adv import pandas as pd from ua_parser import user_agent_parser pd.options.display.max_columns = None adv.logs_to_df(log_file='data/sample_log.log', output_file='data/adv_logs.parquet', errors_file='data/adv_errors.txt', log_format='combined') 

Read the parquet file into a pandas DataFrame, and convert the datetime column into a datetime object.

logs_df = pd.read_parquet('data/adv_logs.parquet') logs_df['datetime'] = pd.to_datetime(logs_df['datetime'], format='%d/%b/%Y:%H:%M:%S %z') logs_df 

Perform a reverse DNS lookup on the IP addresses in the client column:

%%time host_df = adv.reverse_dns_lookup(logs_df['client']) print(f'Rows, columns: host_df.shape>') host_df.head(15) # Rows, columns: (1210, 9) # CPU times: user 745 ms, sys: 729 ms, total: 1.47 s # Wall time: 21.1 s 

Источник

Parse a Log File in Python

Parse a Log File in Python

A log file contains information about the events happening during the running of a software system or an application. These events include errors, requests made by the users, bugs, etc. Developers can further scan these details about the use to figure out potential problems with the system, implement newer and better solutions, and improve the overall design. Log files can reveal a lot about the system’s security, which helps developers improve the system or the application.

Generally, entries inside a log file have a format or a pattern. For example, a software system can have a format that prints three things: timestamp, log message, and message type. These formats can have any amount of information structured in a well-formatted text for readability and management purposes.

To perform analysis over these log files, one can consider any programming language. But this article will specifically talk about how one can parse such log files using Python. Nevertheless, the theory behind the process remains the same for all programming languages. One can easily translate the Python code to any other programming language to perform the required task.

Parse a Log File in Python

As mentioned above, entries inside a log file have a specific format. This means we can leverage this format to parse the information written inside a log file line by line. Let us try and understand this using an example.

Consider the following log format that is being used for a web application. It has four significant details, namely, the date and time or the timestamp ( yyyy-mm-dd hh:mm:ss format), the URL accessed, the type of log message (success, error, etc.), and the log message.

DateTime | URL | Log-Type | Log 

Now, consider a file log.txt that contains logs in the format mentioned above. The log.txt file would look something like this.

2021-10-26 10:26:44 | https://website.com/home | SUCCESS | Message 2021-10-26 10:26:54 | https://website.com/about | SUCCESS | Message 2021-10-26 10:27:01 | https://website.com/page | ERROR | Message 2021-10-26 10:27:03 | https://website.com/user/me | SUCCESS | Message 2021-10-26 10:27:04 | https://website.com/settings/ | ERROR | Message . 

The following Python code will read this log file and store the information inside a dictionary. A variable order stores all the dictionary keys in the same order as that of a single log. Since the log formal has a | , we can use it to split a log string into elements and further store those however we like.

import json  file_name = "log.txt" file = open(file_name, "r") data = [] order = ["date", "url", "type", "message"]  for line in file.readlines():  details = line.split("|")  details = [x.strip() for x in details]  structure = for key, value in zip(order, details)>  data.append(structure)  for entry in data:  print(json.dumps(entry, indent = 4)) 

Источник

Оцените статью