... because from time to time I'm a web developer, too
About me
Projects
Contact
Links

Match regexp in python – I’m building logsplitter

October 7, 2009 | python
author: Karol Zielinski | comments: 2 | views: 2270
Tags: , , ,

Regular expressions are one of the most important things to learn, when we are talking about professional web development. Today I will present how regular expressions works in python. I will use my own script called logsplitter to present, how it works in practice.

First: what is regular expression?
“In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. A regular expression (often shortened to regex or regexp) is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.”

A bit more informations about regexps in python you can find here.

And now… my simple example. I will present really simple logsplitter (I use it to split logs into different files based on variable called session).

How it works?

I have one file called “my_logs.log”, where I can find all logs from my application. These logs are e.g. in format:

2009-10-07 08:15:09,536,536 INFO [my_project.file at line 526] [2009-10-07_08-14-52_tto9g] MY LOGGED INFO

or

2009-10-07 08:15:09,536,536 ERROR [my_project.file at line 526] [2009-10-07_08-14-52_tto9g] MY LOGGED ERROR

where:

  • 2009-10-07 is a date
  • INFO / ERROR is a message level
  • my_project.file at line 526 is a path to the file and a line in code, where logger has been used
  • 2009-10-07_08-14-52_tto9g is my unique session
  • MY LOGGED INFO / MY LOGGED ERROR is my logged information

so let create some code:

vim testregexp.py

and:


#!/usr/bin/env python

import sys
import os
import os.path
import shutil
import re

file_to_split = 'my_logs.log'
moved_file_to_split = 'my_logs-moved.log'
logs_dir = '/opt/my_project/log/'

logfile_path = logs_dir + file_to_split
moved_logfile_path = logs_dir + moved_file_to_split
logs_splitted_dir = logs_dir + 'splitted/'

logline_re = re.compile("^(?P<date>.*?) (.*?) (?P<level>.*?) \[(?P<path>.*)\] \[(?P<session>.*)\] (?P<message>.*?)$")

if not os.path.exists(logs_dir):
    print '\nERROR :: I can\'t find logs_dir ' + str(logs_dir) + '\n'
    sys.exit()

if not os.path.exists(logfile_path):
    print '\nERROR :: I can\'t find logfile_path ' + str(logfile_path) + '\n'
    sys.exit()

# create folder for splitted logs if it does not exist
if not os.path.exists(logs_splitted_dir):
    os.mkdir(logs_splitted_dir)

# check existing splitted log files
present_logs = os.listdir(logs_splitted_dir)
present_logs_by_session = []
for present_log in present_logs:
    present_logs_by_session.append(present_log.replace('_ERROR', ''))

# move log file to other file
shutil.move(logfile_path, moved_logfile_path)

logfile = open(moved_logfile_path, 'r')
logfile_content = logfile.read()
logfile.close()

splitted_logs_by_session = {}

summary_data_error = 0
summary_data_warning = 0
summary_data_info = 0
summary_data_sessions = 0
summary_data_all_lines = 0
summary_data_doesnt_match = 0
summary_data_match = 0

error_sessions = []

prev_session = None
# parse main log file
for line in logfile_content.split('\n'):
    summary_data_all_lines += 1

    match = logline_re.match(line)
    if match:
        summary_data_match += 1

        line_session = match.group('session')
        line_date = match.group('date')
        line_level = match.group('level')
        line_path = match.group('path')
        line_message = match.group('message')

        if line_level == 'ERROR':
            summary_data_error += 1
            if line_session not in error_sessions:
                error_sessions.append(line_session)

        elif (line_level == 'WARNING' or line_level == 'WARNI'):
            summary_data_warning += 1

        elif line_level == 'INFO':
            summary_data_info += 1

        session_logfile_path = logs_splitted_dir + line_session
        session_logfile_error_path = logs_splitted_dir + line_session + '_ERROR'

        try:
            is_splitted_logs_by_session = splitted_logs_by_session[line_session]
        except KeyError:
            summary_data_sessions += 1
            splitted_logs_by_session[line_session] = []

            if line_session in present_logs_by_session:
                session_logfile = None
                if line_session in error_sessions:
                    if os.path.exists(session_logfile_error_path):
                        session_logfile = open(session_logfile_error_path, 'r')
                else:
                    if os.path.exists(session_logfile_path):
                        session_logfile = open(session_logfile_path, 'r')

                if session_logfile:
                    session_logfile_content = session_logfile.read()
                    session_logfile.close()

                    for session_line in session_logfile_content.split('\n'):
                        if not session_line:
                            continue

                        splitted_logs_by_session[line_session].append(session_line)

        splitted_logs_by_session[line_session].append(line)
        prev_session = line_session

    else:
        if not prev_session:
            continue

        summary_data_doesnt_match += 1
        splitted_logs_by_session[prev_session].append(line)

# create splitted log files
for session_id, session_message_list in splitted_logs_by_session.iteritems():
    message = ''
    for each_message_in_session in session_message_list:
        message += each_message_in_session
        message += '\n'

    session_logfile_path = logs_splitted_dir + session_id
    session_logfile_error_path = logs_splitted_dir + session_id + '_ERROR'

    if os.path.exists(session_logfile_path):
        os.remove(session_logfile_path)

    elif os.path.exists(session_logfile_error_path):
        os.remove(session_logfile_error_path)

    if session_id in error_sessions:
        session_logfile_path = session_logfile_error_path

    session_logfile = open(session_logfile_path, 'w')
    session_logfile.write(message)
    session_logfile.close()

print '\nSTATUS: Everything\'s fine.\n'
print '----------- SUMMARY START -----------' + '\n'
print 'All lines to parse in log file: ' + str(summary_data_all_lines)
print 'All lines, that match to our regexp: ' + str(summary_data_match)
print 'All lines, that doesn\'t match to our regexp: ' + str(summary_data_doesnt_match) + '\n'
print 'Amount of parsed sessions: ' + str(summary_data_sessions) + '\n'
print 'Amount of info messages: ' + str(summary_data_info)
print 'Amount of warning messages: ' + str(summary_data_warning)
print 'Amount of error messages: ' + str(summary_data_error) + '\n'
print '----------- SUMMARY END -----------'
print '\n'

sys.exit()

now we just need to run our script:

python testregexp.py
Bookmark and Share
Post Match regexp in python – I’m building logsplitter to develway Post Match regexp in python – I’m building logsplitter to Delicious Post Match regexp in python – I’m building logsplitter to Digg Post Match regexp in python – I’m building logsplitter to Facebook Post Match regexp in python – I’m building logsplitter to Reddit Post Match regexp in python – I’m building logsplitter to StumbleUpon

Related news and resources

Comments (2)

4Avatars v0.3.1 v0.3.1
barszcz
October 8, 2009, 8:38 pm

Hmm… I swear that your code is somehow very familiar to me ;)

4Avatars v0.3.1 v0.3.1
Karol Zielinski
October 9, 2009, 1:32 am

Sure it is. This logsplitter is based on your code in many places. :)

Write a comment

Karol Zielinski :: Just a tech stuff Hello, I'm Karol Zielinski, internet evangelist, an entrepreneur, project manager and a web developer from Gdynia, Poland. I like creative design, good advertisement, social media and all kind of stuff around the web.

Most popular posts

Much more links

Karol Zielinski    |   contact me
Gdynia, Poland
RSS - Just a tech stuff - python, java blog - web development blog Karol Zielinski on twitter Karol Zielinski on LinkedIn Karol Zielinski on facebook Karol Zielinski on delicious Karol Zielinski on digg Karol Zielinski on flickr Karol Zielinski on stumbleupon Karol Zielinski on technorati