How LogJoint parses text files

Log as string

LogJoint considers a textual log file as a one big string. This is a logical representation, of-course physically LogJoint doesn't load the whole file into a string in memory. A string here means a sequence of Unicode characters. To convert a raw log file to Unicode characters LogJoint uses the encoding specified in your format's settings.

Suppose we have this log file:

i 2010/3/1 13:30:23 Msg 1
w 2010/3/1 13:30:24 Msg 2

It contains two messages, one message per line. Each line starts with severity mark (i - information, w - warning). It is followed by date and time stamp. The rest of a message is some text with no fixed structure.

The log would be represented by the following string.

Symbol represents newline character.

Header regular expression

LogJoint uses regular expressions to find and parse messages in the string. The first and the most important one is header regular expression. It is supposed to match the beginnings (or headers) of messages. LogJoints takes advantage of the fact that log messages usually have well recognizable or even fixed headers. In our example the header of a message is severity mark followed by the date/time information. Each message starts at new line. The header regular expression may look like this:
^             # every message starts from new line
(?<sev>       # begin of capture
  [i|w|e]     # severity mark
)             # end of capture
\s            # space between severity mark and date/time
(?<date>      # begin of capture
  \d{4}       # 4-digit year
  \/          # slash separating year and month
  \d{1,2}     # one or two digits representing month
  \/          # slash separating month and day
  \d{1,2}     # one or two digits representing the day
  \s          # space between date and time
  \d{2}       # two-digit hour
  \:          # time separator
  \d{2}       # two-digit minutes
  \:          # time separator
  \d{2}       # two-digit seconds
)             # end of capture

Note that LogJoint ignores unescaped white space in patterns and treats everything after # as a comment. This regex captures two named values: sev - the severity of the message and date - date/time information. The need for these captures will be described later. ^ at the beginning of the regex matches the beginning of any line in the source string, not just the beginning of the entire string. Programmers can read about IgnorePatternWhitespace, ExplicitCapture, and Multiline flags that are actually used here in msdn: RegexOptions Enumeration.

LogJoint applies the header regular expression many times to find all the messages in the input string. In our example the header regex will match two times and will yield two sets of captures:

Thick black lines show message boundaries. After applying header regex LogJoint knows where the messages begin and where they end. A messsage ends where the next message begins.

Body regular expression

The next step LogJoint makes is parsing the content of the message. LogJoint uses body regular expression for that. Body regex is supposed to parse (break down to the captures) the part of the message that follows the header. In our example the body regex will be applied for these substrings:

In the example the actual message content doesn't have any structure. Because of that there is no special fields that we want to parse by body regex. The body regular expression would look like this:

^              # align to the beginng of the message's body (i.e. to the end of the header)
(?<body>        # begin a capture
  .*           # match everything without any parsing
)              # end a capture
$              # align to the end of body (i.e. to the beginning of next message)

This regex captures all the input substring to the capture named body. The need in capturing is explained below. Actually body regex as it was specified above can be omitted altogether - LogJoint assumes it by default. It is important that in body regular expression's the meaning of ^ and $ is different from header regexps. Here they match the beginning and the end of the entire body substring. You can read more in msdn: Singleline regex option (RegexOptions).

Fields mapping

Summarizing what has been said: LogJoint uses regular expressions to divide up the input string into separate messages and to get the set of named substrings (captures) for each message. The final step is to map this set of substrings to the fields that LogJoint will use to construct message object. There are predefined fields that are recognized and handled by LogJoint special way. There might be user-defined fields as well.

Here is the table of predefined fields:

Field name Type Description
Time DateTime Defines the timestamp of log message. This field is important for LogJoint to correlate messages from different sources and to allow timeline navigation functionality.
Thread String Defines the thread identifier of the message. All the messages of the same thread will have the same backgound color.
Severity Severity Defines the severity of the message. Severity might be Severity.Information, Severity.Warning or Severity.Error.
Body string Actual content of the message, its text content.

Any field with name different from the names in the table above are user-defined. They are automatically appended to Body field using "field name"="field value" format.

Don't confuse regex captures and message fields. The captures are raw strings that are cut out of the log. Message fields are strongly typed, they define the message object that LogJoint works with. When you define a new format you need to provide the way to map the input captures to output fields. This mapping is called fields mapping. Basically it is a table that contains formulas for each output field. Formulas are expressions or pieces of C# code. Formulas use language expressions or function calls to convert regex captures (that are strings) to strongly typed output fields. Internally LogJoint takes the formulas you provided and generates a temporary class. This class is used then in the parsing pipeline.

Here is example:

Field Formula type Formula Comments
Time Expression
TO_DATETIME(date, "yyyy/M/d HH:mm:ss")
This formula is an expression. It calls predefined function TO_DATETIME() passing date capture as a parameter. The names of all regexp captures are available in the context of the expression. They have string type. TO_DATETIME() returns the value of type DateTime. Expressions must evaluate to the type that is compatible with field's type.
Body Expression
body
This formula is simple: it just returns body capture. Remind you: Body field has String type and so does body capture.
Severity Function
switch (sev)
{
case "w":
  return Severity.Warning;
case "e":
  return Severity.Error;
default:
  return Severity.Info;
}
This formula is a function. The difference between expressions and functions is that the function may contain any sequence of statements and must return a value (return statement). Expressions may contain only one expression, no statements. Expressions are shorter and simplier but thay are somewhat limited. In formulas of type Function you are free to implement any business logic.

All fields except Time are optional. If you don't provide a formula for Thread field LogJoint will consider all messages to have the same thread. The default severity is Info.

Summary

Here is the picture of overall parsing pipeline: