Match regexp in python – I’m building logsplitter
October 7, 2009 | pythonauthor: Karol Zielinski | comments: 2 | views: 2270
Tags: log, python, regexp, regular expression
Regular expressions are one of the most important things to learn, when we are talking about professional web development. Today I will present how regular expressions works in python. I will use my own script called logsplitter to present, how it works in practice.
First: what is regular expression?
“In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. A regular expression (often shortened to regex or regexp) is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.”
A bit more informations about regexps in python you can find here.
And now… my simple example. I will present really simple logsplitter (I use it to split logs into different files based on variable called session).
How it works?
I have one file called “my_logs.log”, where I can find all logs from my application. These logs are e.g. in format:
2009-10-07 08:15:09,536,536 INFO [my_project.file at line 526] [2009-10-07_08-14-52_tto9g] MY LOGGED INFO
or
2009-10-07 08:15:09,536,536 ERROR [my_project.file at line 526] [2009-10-07_08-14-52_tto9g] MY LOGGED ERROR
where:
so let create some code:
vim testregexp.py
and:
#!/usr/bin/env python
import sys
import os
import os.path
import shutil
import re
file_to_split = 'my_logs.log'
moved_file_to_split = 'my_logs-moved.log'
logs_dir = '/opt/my_project/log/'
logfile_path = logs_dir + file_to_split
moved_logfile_path = logs_dir + moved_file_to_split
logs_splitted_dir = logs_dir + 'splitted/'
logline_re = re.compile("^(?P<date>.*?) (.*?) (?P<level>.*?) \[(?P<path>.*)\] \[(?P<session>.*)\] (?P<message>.*?)$")
if not os.path.exists(logs_dir):
print '\nERROR :: I can\'t find logs_dir ' + str(logs_dir) + '\n'
sys.exit()
if not os.path.exists(logfile_path):
print '\nERROR :: I can\'t find logfile_path ' + str(logfile_path) + '\n'
sys.exit()
# create folder for splitted logs if it does not exist
if not os.path.exists(logs_splitted_dir):
os.mkdir(logs_splitted_dir)
# check existing splitted log files
present_logs = os.listdir(logs_splitted_dir)
present_logs_by_session = []
for present_log in present_logs:
present_logs_by_session.append(present_log.replace('_ERROR', ''))
# move log file to other file
shutil.move(logfile_path, moved_logfile_path)
logfile = open(moved_logfile_path, 'r')
logfile_content = logfile.read()
logfile.close()
splitted_logs_by_session = {}
summary_data_error = 0
summary_data_warning = 0
summary_data_info = 0
summary_data_sessions = 0
summary_data_all_lines = 0
summary_data_doesnt_match = 0
summary_data_match = 0
error_sessions = []
prev_session = None
# parse main log file
for line in logfile_content.split('\n'):
summary_data_all_lines += 1
match = logline_re.match(line)
if match:
summary_data_match += 1
line_session = match.group('session')
line_date = match.group('date')
line_level = match.group('level')
line_path = match.group('path')
line_message = match.group('message')
if line_level == 'ERROR':
summary_data_error += 1
if line_session not in error_sessions:
error_sessions.append(line_session)
elif (line_level == 'WARNING' or line_level == 'WARNI'):
summary_data_warning += 1
elif line_level == 'INFO':
summary_data_info += 1
session_logfile_path = logs_splitted_dir + line_session
session_logfile_error_path = logs_splitted_dir + line_session + '_ERROR'
try:
is_splitted_logs_by_session = splitted_logs_by_session[line_session]
except KeyError:
summary_data_sessions += 1
splitted_logs_by_session[line_session] = []
if line_session in present_logs_by_session:
session_logfile = None
if line_session in error_sessions:
if os.path.exists(session_logfile_error_path):
session_logfile = open(session_logfile_error_path, 'r')
else:
if os.path.exists(session_logfile_path):
session_logfile = open(session_logfile_path, 'r')
if session_logfile:
session_logfile_content = session_logfile.read()
session_logfile.close()
for session_line in session_logfile_content.split('\n'):
if not session_line:
continue
splitted_logs_by_session[line_session].append(session_line)
splitted_logs_by_session[line_session].append(line)
prev_session = line_session
else:
if not prev_session:
continue
summary_data_doesnt_match += 1
splitted_logs_by_session[prev_session].append(line)
# create splitted log files
for session_id, session_message_list in splitted_logs_by_session.iteritems():
message = ''
for each_message_in_session in session_message_list:
message += each_message_in_session
message += '\n'
session_logfile_path = logs_splitted_dir + session_id
session_logfile_error_path = logs_splitted_dir + session_id + '_ERROR'
if os.path.exists(session_logfile_path):
os.remove(session_logfile_path)
elif os.path.exists(session_logfile_error_path):
os.remove(session_logfile_error_path)
if session_id in error_sessions:
session_logfile_path = session_logfile_error_path
session_logfile = open(session_logfile_path, 'w')
session_logfile.write(message)
session_logfile.close()
print '\nSTATUS: Everything\'s fine.\n'
print '----------- SUMMARY START -----------' + '\n'
print 'All lines to parse in log file: ' + str(summary_data_all_lines)
print 'All lines, that match to our regexp: ' + str(summary_data_match)
print 'All lines, that doesn\'t match to our regexp: ' + str(summary_data_doesnt_match) + '\n'
print 'Amount of parsed sessions: ' + str(summary_data_sessions) + '\n'
print 'Amount of info messages: ' + str(summary_data_info)
print 'Amount of warning messages: ' + str(summary_data_warning)
print 'Amount of error messages: ' + str(summary_data_error) + '\n'
print '----------- SUMMARY END -----------'
print '\n'
sys.exit()
now we just need to run our script:
python testregexp.py
Hello, I'm Karol Zielinski, internet evangelist, an entrepreneur, project manager and a web developer from Gdynia, Poland. I like creative design, good advertisement, social media and all kind of stuff around the web.
October 8, 2009, 8:38 pm
Hmm… I swear that your code is somehow very familiar to me
October 9, 2009, 1:32 am
Sure it is. This logsplitter is based on your code in many places.