TeaLeaf That!

Tuesday, July 22, 2014

A Practical Answer to the Question “What to do about Robots?”

Every Tealeaf customer quickly discovers that their data stream contains a large number of robot sessions – web crawlers, scrapers, internal and external keep-alive monitors. Some of these visitors are polite, identify themselves, and honor your robots.txt file. Others are not so polite, try to stay anonymous, and go anywhere on your site they like. Soon after go-live, the tealeaf event administrators raise the question “What do we do about robots in our events”?
I used to belong to the school of thought that said – save everything in Tealeaf. My reasoning was, if a bot starts to have a negative effect on the site, I wanted to find it in Tealeaf. That remains a very important consideration. Over the years, I've found a lot of site performance degradations or even complete site outages attributable to runaway bots. The drawback to saving bot traffic in tealeaf is the complexity it causes in creating events. Creating a variation on every event to include totals, bot-only, and non-bot only is time-consuming. Creating the correct reports from these variations is both time-consuming and often very complex. Trying to communicate to the report consumers (business team executive stakeholders <grin>) the “ifs, ands, and buts” of the data because of the impact of bots – well, lets just say I really would rather not have to.
So I wanted to come up with a solution that would remove bot traffic from the canisters so they wouldn't impact business events, yet still have the traffic available for analysis. I've come up with a solution that will take them out of the data stream, but save their pertinent data to TLA files for later analysis if desired. A couple of advantages to this approach are:

You are going to be shocked by how much you save in RAM, Disk and CPU processing once the robot data is out of the data stream. Of course that is offset somwhat by the storage space it takes for the TLA files – but the solution drops response bodies, for an overall reduction in disk consumption.
You no longer have to take special care to eliminate robots from your event definitions, nor create special events to recognize robots. Your data stream is now “real users only”, to the extent that you can make it so.

First, one caveat- this post is NOT about how to recognize stealth robots. Later, we will show you how to treat certain IPs as ‘All traffic from this IP is Bot traffic’, but I'm going to assume you are doing some kind of external analysis of your data to identify stealth robots by their IP address.

The following picture shows the pipeline components necessary for doing the Bot elimination. A nice PDF of the drawing is here, and a zip file of the pipeline components that are discussed in this post can be found here.

Before we start, make sure you have the ‘use HTTP_X_FORWARDED_FOR’ radio button checked on the PCA server to ensure that the REMOTE_ADDR is pointing to the true remote IP address, not to the load balancer in your data center. This is covered in the PCA administrator's guide.

Start with the TLTREF session agent (SA). You should already have this in your pipeline, and using the browscap.csv file (and maintaining it!) There’s no reason for me to recap how to setup TLTREF and Browscap, it is well covered in the tealeaf documentation. This SA creates the [ExtendedUserAgent] section in the request block, and for bots recognized by user-agent, sets the TLT_TRAFFIC_TYPE=BOT

TLTREF session agent:

[TLTRef]
TypeName=TLTRef
DLL=SessionAgentTLTRef.dll
DownStreamConfigSection=PrivacyStealthRobotByIP
ReferenceMaster=true
UpdateInterval=600
ReferenceFileMaxSize=1000
URLReferenceFile=.\system\Reference_Path.txt
ServerReferenceFile=.\system\Reference_Server.txt
HostReferenceFile=.\system\Reference_Host.txt
AppReferenceFile=.\system\Reference_Application.txt
URLReferenceRules=true
URLReferenceRulesMode=cont
URLReferenceRulesFile=.\system\Reference_Path_Rules.txt
UseFullVirtualDir=False
NormalizeHost=True
NormalizedHostValues=WWW.<YourSiteHere>.COM;
NormalizeServer=True
NormalizeAppName=True
NormalizeURL=True
NormalizeURLExt=ACTION;ASMX;ASP;ASPX;CSS;DO;HTM;HTML;GIF;ICO;JPG;JS;JSP;JHTML;PDF;PHP;SWF;TXT
NormalizeURLStatusCode=0;200;204;206;301;302;303;304;400;402;403;404;410;500;501;502;503;504;505
AllowEmptyExtension=True
NormalizeURLRootDefault=DEFAULTPAGE
OutputReferenceStats=True
StatsKeepDays=30
OutputReferenceStatsMin=1
StatsReportingInterval=60
OutputUserAgentStats=True
ReferenceIDOffset=0
ReferrerParsing=True
ReferrerPrepend=REF_
AdvancedUAParsing=True
UAFilesDir=.\system\
PruningInterval=60
MaxCacheSize=2000
EnableUserAgentLogFile=True
UALog=False
UserAgentLogLevel=1

Below (downstream) of TLTREF we are going to add the PrivacyStealthRobotByIP session agent, a privacy SA that uses the configuration file PrivacyStealthRobotByIP.cfg. The following is an example of the single rule in this privacy cfg file. As you can see, it uses a RegExpr to list IP addresses separated by the ‘|’ character, and anchors these alternations to the start and end of the REMOTE_ADDR value buffer.Any IP address that matches one of the alternatives sets the request’s TLT_TRAFFIC_TYPE to BOT, and the TLT_BROWSER to ‘Stealthbot-<IPaddress>’. The <IPaddress> used in the TLT_BROWSER is the value of the REMOTE_ADDR.

PrivacyStealthRobotByIP session agent:

[PrivacyStealthRobotByIP]
TypeName=PrivacyStealthRobotByIP
DownStreamConfigSection=RTARouteBots
ConfigFile=PrivacyStealthRobotByIP.cfg
DLL=SessionAgentPrivacy.dll
LogLevel=Error
LogRules=True

PrivacyStealthRobotByIP.cfg

[Rule1]
StopProcessing=True
Enabled=True
Tests=T_1,T_2
Actions=A_Traffic_Type, A_Browser

[T_1]
ReqField=REMOTE_ADDR
ReqOp=PartOfList
ReqVal=192.168.255.255;10.10.10.255

[T_2]
ReqField=REMOTE_ADDR
ReqOp=PartOfList
ReqVal=216.106.218.118;209.85.238.2;209.85.238.9

[A_Traffic_Type]
Action=ReqSet
ReqSetSection=[ExtendedUserAgent]
ReqSetField=TLT_TRAFFIC_TYPE
ReplaceString=BOT

[A_Browser]
Action=ReqSet
ReqSetSection=[ExtendedUserAgent]
ReqSetField=TLT_BROWSER
StartPatternRE=REMOTE_ADDR=(.*)
Inclusive=True
ReplaceString=Stealthbot-{g1}

Now that robots have been identified in the data stream, the next downstream session agent is the RTARouteBots SA. This SA is going to use the TRAFFIC_TYPE to pick off the Bot hits and send them off to a ChildRouter pipeline. This SA uses the file ‘Rule Scripts\RTARouteBots.ini file, which has a single rule which sets the pipeid to ‘1’ if the TRAFFIC_TYPE is ‘Bot’. In the RTARouteBots Session Agent, the PipelineConfig1 names the file HRCP_RouteBots.cfg. All Bots hits are going to be sent to the child pipeline defined in this cfg file, and all non-bot hits are going to continue on down the main pipeline.

RTARouteBots session agent:

[RTARouteBots]
TypeName=RTARouteBots
DLL=SessionAgentPipelineSplitter.dll
DownStreamConfigSection=<YourNextDownstreamSA>
ScriptTrace=OFF
RTAIni=RTARouteBots.ini
ResponseType=All
EnvironmentScript=EngineEnvironment.tcl
PreProcScript=RTA.PreProc.tcl
ActionScript=RTA.Action.tcl
PipelineConfig1=HRCP_RouteBots.cfg

RTARouteBots.ini:

TeaLeaf RTA

<RULE NUM="1" STATUS="ENABLED" DESCRIPTION="Route Bot Traffic Down PipelineID1" STOPRULES="YES">
  <GROUP TYPE="REQTest" REQF="TLT_TRAFFIC_TYPE" REQOP="CONTAINS" SRCHCASE="NO" REQVAL="BOT">
    </GROUP>
  <GROUP TYPE="SetPipeID" PipeID="1" Type="SET">
    </GROUP>
</RULE>

This ends our changes to the main pipeline. Simple add the two new session agents to your existing pipeline just downstream of the TLTREF session agent, and in the RTARouteBots SA, define your DownstreamConfigSection appropriately.

The Hit Router Child Pipeline file HRCP_RouteBots.cfg file is derived from a standard child pipeline. It has only a couple of session agents; Decouple, a Privacy SA used to drop hit response buffers, and Archive. Just after Decouple, the Privacy SA PrivacyDropBotHitResponseBodies uses the configuration file by the same name, PrivacyDropBotHitResponseBodies.cfg. This file has a single rule that fires on every hit, and an action that removes the response bodies from the hit. There is almost no reason why you want to keep the pages the robot saw in your archive files, and the savings in the size of the file is usually about 10-fold. In the Archive SA, the archive file location is defined, along with instructing the archiver to roll the archive file at midnight every day, and also roll the file during the day if it exceed 500MB.

HRCP_RouteBots.cfg:

#This child pipeline is intended to get BOT only traffic from the main pipeline
[StatusMaster]
DLL=StatusMaster.dll
AppName=BotTraffic
[Globals]
DownStreamConfigSection=DecoupleToBotArchive
#DownStreamConfigSection=DecoupleToBotSocket
#DownStreamConfigSection=Null
[DecoupleToBotArchive]
TypeName=DecoupleToBotArchive
DLL=SessionAgentDecouple.dll
MaxQueueSize=2500
DownStreamConfigSection=PrivacyDropBotHitResponseBodies
[PrivacyDropBotHitResponseBodies]
TypeName=PrivacyDropBotHitResponseBodies
DLL=SessionAgentPrivacy.dll
DownStreamConfigSection=BotArchive
ConfigFile=PrivacyDropBotHitResponseBodies.cfg
LogRules=True
[BotArchive]
TypeName=BotArchive
DLL=SessionAgentArchive.dll
LogDirectory=D:\BotTrafficArchives
FileID=BotTraffic
MaxLogSize=500MB
RollTime=00:01
QuotaPctFree=2
QuotaScanTime=20
QuotaDLL=DiskQuota.dll
############# Below this is for sending to another canister on another server (VM)
[DecoupleToBotSocket]
TypeName=DecoupleToBotSocket
DLL=SessionAgentDecouple.dll
MaxQueueSize=2500
DownStreamConfigSection=BotSocket
[BotSocket]
TypeName=BotSocket
DLL=SessionAgentSocket.dll
Server=<Server_Name_Here>
Port=1966
[Null]
TypeName=Null
DLL=SessionAgentNull.dll

PrivacyDropBotHitResponseBodies.cfg:

[Rule1]
Enabled=True
Actions=A_DropResponse
[A_DropResponse]
Action=DropResponse

That's it! There is a link here for all of these files. Edit your copies of these cfg files for the IP addresses, the location where you want the TLA files, and the RTARouteBots SA’s DownStreamConfigSection, drop them into your pipeline, restart the transport service, and watch the bots disappear from your canisters and appear in the filesystem in TLA files!

Maintenance (ugh!)
You will need to write a scheduled task that runs every night and removes BOT TLA files older than say, 14 days (whatever works for you). Type the word ‘task’ into the search box and the Windows task scheduler will appear near the top of the results – open it and setup your task here.

Maintain your browscap.csv file and your list of Stealth IP addresses regularly.
Keep an eye on Bot traffic to ensure there are no real users (orders page URLs) that appear in these files.

Analysis
You can open a .TLA archive file with RTV, and do analysis on the bot traffic that way. But it’s not very good at summarizing the data or finding the heaviest-hitters. I’ve found that a very powerful tool for analysis is to use Excel Pivot tables.If you are not a pivot table expert – Google it! There are tons of resources for learning it. Really – you will sooo thank yourself for taking the time to learn how to use these! I’m going to show you how to convert a TLA into a CSV file ready to be consumed by a pivot table.

For analysis, it is important to know that a TLA file is just a ZIP file. You can open a TLA file with 7-Zip. Each TLTSID produces a subfolder, and within each TLTSID’s subfolder are the .req files for each request. You can open the .req files with any editor, and see a very familiar REQ block.

Writing a PowerShell script to read the TLA files and produce a .CSV summary was a bit persnickety. I’ve included one in the zip package that does a simple analysis. It works on just one TLA file at a time, and it doesn’t take arguments - you’ll need to edit the first four lines to setup things like input and output file names. If I get sufficient interest and comments, I’ll consider making the script user-friendly and production–ready.

The output of the PowerShell script is a .CSV file. Open it with Excel, hit ‘ctrl-A’ to select everything, and Insert->PivotTable->PivotChart (then take the defaults by clicking on ‘OK’).

I've done a fifteen minute video capture showing the kinds of analysis that you can do on this data. The .mp4 file of this video can be found here .

Here is the result showing how many hits came from each BOT for a two hour period. The default set of properties included in the simple analysis will let you pivot on TLT_BROWSER, TLT_BROWSER_VERSION, URL, TLT_URL, STATUSCODE, REMOTE_ADDR, TLTSID and any combination of these. You can find out which bot hits you the most often, how many are producing status code 500 (and how many) errors, and if any are hitting pages that you don’t have in your TLTREF SA’s NormalizeURLExt list (look for places where URL=Others). Remember to save the file as an Excel Worksheet to preserve your pivot table work.

Next Steps:
If anyone extends the powershell script for additional features , or comes up with additional analysis tricks for the CSV file, be sure to post and share them in the Tealeaf Users Group on LinkedIn. Ditto for any ideas you have on finding the IP addresses for Stealth bots.

I hope this technique helps you write tealeaf events that are better focused on the business needs, and also provide a good analysis of your robot traffic. Happy TeaLeaf’ing!

Monday, March 31, 2014

Privacy Rules and blocking PCI data

I recently got a comment asking for clarification of the tealeaf privacy rules. These are the critically important part of Tealeaf data processing that eliminates or masks Personal Confidential Information (PCI), things like credit card numbers and passwords. I looked at the IBM FAQs, and was very surprised to learn there was very little information present. Years ago I wrote a technical explanation how to block PCI data in a value attribute and posted it on the tealeaf community site, but it appears that post did not get from viaTeaLeaf into the IBM site, so I’ll re-write and expand upon it here.

Where PCI Blocking occurs

There are three different places in the tealeaf systems where you need to make configuration changes to effectively block PCI data; the hit’s request block, it’s response block, and in the client-side Client User Interface (CUI) recorded data hit. Blocking PCI data in the request and response is accomplished in the tealeaf pipeline. The CUI data blocking is done in the CUI/SDK configuration file. In the pipeline, the privacy session agents (Privacy and PrivacyEx) can block or mask PCI data. These two terms are significantly different to the information security teams. Blocking PCI data means it is destroyed in the data stream – there is no way to recover it. Masking means to encrypt the PCI data in such a way that only authorized users can see it. Blocking is easier to implement, and I’ll use the term ‘block’ through out most of this post. Masking the data requires more implementation steps, and I’ll devote a section to that later in the post.

I always urge clients to implement all the PCI data blocking in the pipeline at the PCA tier – doing it here keeps PCI data off all the downstream servers, and only the PCA servers have to be made PCI compliant and audited by the information security teams at their company

PCI Blocking in the Pipeline

Privacy rules are implemented in the privacy agents, and can be used for much more than just blocking or masking PCI data. Privacy rules are usually doing some kind of search and replace/extract operations. This blog post is going to focus narrowly on just their use in PCI data. Privacy rules aren’t really very hard – they are just search and replace patterns. They are implemented in the privacy session agent of the the tealeaf data pipeline, and can be put into the pipeline at any tier – PCA, HBR, or Processing server. But for PCI blocking, I strongly urge this be done in the PCA.

The privacy session agent reads the privacy.cfg file for its search and replace patterns. Since the PCA servers are Linux boxes and the HBR and Processing servers are Windows boxes, the path names to the privacy files will of course be different. But the contents of the file are identical. In its simplest format, it is a list of [rules]. The PCA has a visual GUI for editing the privacy rules on the PCA. It’s instructive to try different privacy rule formulations in the GUI and see how it affects the privacy.cfg file, but for this post, I’ll be old-school and focus on just the contents of privacy.cfg. You can edit this file with any text editor, and you can put this file under source code control to track version changes.

In a few weeks I hope to post details on how to protect this file from changes, to reduce the possibility that an unscrupulous system admin makes changes to allow harvesting of PCI information.

Blocking the data in the Request block

The first and easiest place you need to block data is in the request block of the hit. Your web application is going to put up a form, and there are going to be <input> tags that define text fields where the users enter passwords, credit card numbers, CVV numbers, new passwords, answers to security questions, old passwords/new passwords, and other PCI data fields. Every HTML input tag has either a name= or an id=attribute in the tag. When the page posts (or ‘gets’), the input data, along with the name or id is passed as either a query parameter or as part of the request body. For privacy, you don’t have to care if the page ‘posts’ or ‘Gets’, tealeaf auto magically blocks both ‘post’-ed or ‘get’-ed data. To block specific data fields in the request blocks, you specify a list of input field names (or ids). It just takes one rule and one action block. The rule is always enabled, it specifies an action block, and that action block specifies the action ‘Block’, the Section ‘urlfield’ and the ValueName = ‘the comma separated list of fields to block’. Together, they look like this:

[Rule1]
Enabled=True
Actions=A_TextBlockURLFields

[A_TextBlockURLFields]
Action=Block
Section=urlfield
ValueName=CreditCard|CardNumber|NewPassword|SecurityAnswer1

I’ve managed ValueName lists that are pretty long. Often the web application is developed with .Net or other frameworks, and the framework assigns the control names. .Net applications in particular have very long control names, like ctl00$ContentInfo$CreditCard$CreditCardNumber$txtCCNum, and ctl00$ContentInfo$SecurityQuestionAnswer$AnswerReType$txtSecurity. You do have to fully specify (no wild cards) the full name of every field to block.

If the web application has accrued lots of pages and lots of PCI-impacting fields from lots of developers over the years, managing the list of PCI-impacting field names can be a pain. But a little creative Excel spread sheeting makes it possible to manage even very long lists of field names to block. I’ve a couple of Excel formulas that can help – comment on this post if you are interested.

Blocking the data in the Response block

The next place to block PCI data is in the response block. This takes a bit more work, and you have to know your application. In particular, where does your application display credit card numbers, passwords, answers to security questions, etc? Usually, there a far fewer places where your application echoes PCI data back to the browser. But you will need to look at example pages where, for example, a credit card number is displayed, or a list of security questions/answers are displayed. What you are going to need to do, is write (a) rule(s) that can match a preamble – sensitive data – postamble. The privacy rules support most Regular Expression constructs, so its pretty easy to write the rules, once you have found some example pages. You don’t have to match every possible combination in one expression, it’s fine to have multiple actions, each matching one or more places where PCI data may appear in the response. Nor do you need to specify the page name (URL) where the PCI data lives – in fact, I never specify a URL when blocking sensitive data – it’s too easy for a URL to get changed. I always write the rules to look for the data patterns, and let the PCA processors look at every page that comes through, looking for the patterns. As long as the PCA CPUs are not breaking a sweat, there's no problem with letting them inspect every page. Modern servers have plenty of CPU cycles – just keep an eye on all the tealeaf pipeline processes on a PCA during a typical peak-busy-hour, and keep their CPU utilizations under 60% or so, just to be safe.

Back to the regular expressions for data blocking… Earlier we had a rule and an action for blocking the request data. We can simply add another action to the rule for blocking the response data. The action will execute on every hit. Again, as long as the PCA’s CPUs are not heavily loaded, there’s no problem with that. What this particular rule is going to do is to block any value attribute of an HTML input tag. Have you ever entered a credit card number on a page, submitted it, had a mistake somewhere on that page and the web site helpfully echoes the credit card number back in it’s input field? This is usually accomplished with the value attribute, so blocking the value attribute prevents tealeaf from recording the PCI data that the web site echoes back in the input field.

The new rule now has two additional comma-separated action

[Rule1]
Enabled=True
Actions=A_TextBlockURLFields , A_BlockCCInResponse1, A_BlockCCInResponse2

And the privacy.cfg file has two added action blocks.

[A_BlockCCInResponse1]

Action=Block
Section=response
Field=body
StartPatternRE=(?-s)<[^>]*?\sname\s*=\s*(["']{0,1})(CreditCard|CardNumber|NewPassword|SecurityAnswer1)\1\s+[^>]*?value\s*=\s*(["']{0,1}).*?\3[\/\s>]
Inclusive=True
BlockingMask=value\s*=\s*["']{0,1}([^"']*)["']{0,1}[\/\s>]

The tealeaf manual section on privacy rules explains the Action, Section, Field, StartPatternRE, Inclusive and BlockingMask parameters, so we will just focus on the StartPatternRE and the BlockingMask values.

Regular Expressions are your friend!

Regular Expressions, or RegExs, tell a computer how to match a pattern. RegExs are a computer’s native language, and can be made very efficient. There are plenty of tutorials on the web for constructing regular expressions, so I’m going to just explain the RegEx above that will mask the value attribute.

The string in the response to be found will look like <input OptionalStuffWeCanIngore name = “FieldName” OptionalStuffWeCanIngore value = ‘123456789123456’ OptionalStuffWeCanIngore > and the spaces around the = characters are optional, and the string may use single quotes or apostrophes. Developers (and development tools) are free to construct their HTML any way they like, as long as it conforms to the W3C standard, so our RegEx needs to be sophisticated enough to match any standard formulation

First, the alternation construct is (a|b|c) which says to match “a or b or c”. Our RegEx uses an alternation construct to list all of the input field names, like (password|newpassword|oldpassword|creditcardnumber|cvv). Alternation constructs will by default create a match group and record which alternative occurred. Alternation constructs are cheap in terms of processing powers, but match groups are much more expensive. We can tell a RegEx not to create a match group for an alternation by placing (?-s) before the alternation. The RegEx will not create any match groups for any alternation in the RegEx until it encounters (?+s) in the RegEx pattern.

Next we have character classes, [“’], which says to match either the single quote or the apostrophe character, and the length modifier {0,1}, which says to match the anything in the character class 0 or 1 times. Together, [‘”]{0.1} says to match either 0 or 1 quote or apostrophe character. Another character class we use is [^>] which says to match any character except the >, and two more length modifiers we use is the *, which says to match the preceding character 0 or more times, and the + which says to match the preceding character 1 or more times.

Any character is matched by the . character. Whitespace (spaces or tabs) is matched by the special sequence \s. Greedy matching is the default for a RegEx, which means that if you have the string “a b c d e f”"”, and you say a\s*, it will match the longest substring – “a b c d e “. If you want to match the shortest substring, add the ? character after the *, so a\s*? will match “a “.

To match the character / itself, it needs to be “escaped”, so \/ matches the / character.

Our final construct is the backreference. Within a RegEx, we can refer to something that matched earlier within the RegEx, if you put what you want to reference in () grouping. This does not create a match group, so it is not ‘expensive”. Whatever matches within the first pair of () is referenced later in the RegEx as \1; the second pair of () is matched with \2, etc. So when the string has a pair of single quotes around a value, or a pair of apostrophes around a value, the W3C standard says the same character (quote or apostrophe) has to be the beginning and end. (["']{0,1})(a|b\c)\1 says to match a or b or c, only if it has no quote pair or apostrophe pair surrounding (that’s the 0) , or if a or b or c has a pair of quote characters or a pair of apostrophe characters surrounding it.

Putting together some these into short substrings, \s+[^>]*? says to match a whitespace character occurring one or more time, then any character that is not the > occurring 0 or more times (non greedy).

Here is the StartPatternRE, and an explanation of it.

(?-s)<[^>]*?\sname\s*=\s*(["']{0,1})(CreditCard|CardNumber|NewPassword|SecurityAnswer1)\1\s+[^>]*?value\s*=\s*(["']{0,1}).*?\3[\/\s>]

(?-s)<[^>]*?\sname : Don’t create any match groups in this RegEx. Start matching at a < character, ignore anything until “whitespace name”, and if a > appears before the string name, stop trying to match

name\s*=\s* :look for the string name, then 0 or more whitespace, the = character, and zero or more whitespace.

(["']{0,1}) : look for either the single quote character or the apostrophe character, occurring exactly 0 or 1 time. Group this to create a backreference. Since this is the first pair of () characters, this backreference can be referred to a \1.

(CreditCard|CardNumber|NewPassword|SecurityAnswer1) : This is a sample of the alternation that lists the exact field names of your application that need to be blocked. Only fields in which the application will echo PCI data using a value attribute need to be listed, separated by the vertical pipe | character.

\1\s+[^>]*?value: : Match whatever the first backreference matches, then one or more whitespace, then any string of characters that is not the > character (non-greedy, that is, the shortest substring) followed by the string value.

\s*=\s*(["']{0,1}) : 0 or more whitespace, the = character, then zero or more whitespace, then either the single quote character or the apostrophe character, occurring exactly 0 or 1 time. Group this to create a backreference. Since this is the third pair of () characters, this backreference can be referred to a \3.

.*?\3[\/\s>] : Any character occurring 0 or more times (non-greedy), then the third backreference, After the third backreference matches, it must be followed by a whitespace character, or the > character, or the / character. Closing a tag in HTML can be either > or />, so after the value attribute, the input tag either continues with more attributes (the whitespace character will follow the backreference), or the input HTML tag will close, so we need to match either > or /.

Whew! That was a lot of explanation. I hope you followed all of that, but if not, there are tools to help you visualize how all of this works.

Javascript Regular Expression Engines and Testing tools

There are web sites and online tools to help you construct Regular Expressi0ons, and test them, You paste in the string to be tested ( a cut’n’paste of a web page snippet that contains the input tag with value attribute), and the RegEx, run the test, and the tool will tell you if the RegEx matched, Good tools even break down each piece of the RegEx and tell you where it matches the string. But be careful – online tools use different RegEx engines! you need to carefully validate the online tool you use produces the same result as the PCA.

Here are my two favorite online tools for testing a RegEx.

http://regexpal.com/ This one is good, but basic.

http://myregextester.com/ My personal favorite. In particular, it has an ‘explain’ function that breaks down your RegEx piece by piece and shows you why/how it matches. But watch out – the tool doesn’t work well in the Chrome explorer. I use IE when I’m on this site.

These test tools are especially good for testing RegExs you might use in an Advanced Mode event. The Event Processing engine is written in JavaScript, and uses the Google JavaScript engine. Both of the tools above use a JavaScript engine (I have no idea which engine). Make sure your testing tool, whatever you select, uses a JavaScript engine, because there are a few differences between the .Net engine, the Perl Compiled Regular Expression (PCRE) engine, and JavaScript engines. The biggest difference to watch out for is use of the ‘.’ character in multi-line matches. If you want to match ‘any character’ and the substring crosses a line boundary (CR,LF character pair), the .* construct won’t work in JavaScript. But [/S/s]* will work. If you don’t follow that after studying the RegEx rules, post a comment, I’ll go into detail if anybody asks…

The Blocking Mask

The section above described the StartPatternRE portion of the rule. This identifies and isolates the input field that contains a specific name attribute and contains a value attribute. But we have not yet told the Action what to block. We use the Blocking Mask for that.Whatever appears in the first grouping () set of parentheses will be replaced with the StrikeCharacter . The default StrikeCharacter is ‘X’

BlockingMask=value\s*=\s*["']{0,1}([^"']*)["']{0,1}[\/\s>]

With the blocking mask as specified above, the response block will be modified to become value = “XXXXXXXXXX”

The second action for blocking a value attribute

Earlier we discussed adding two actions to our rules, both A_BlockCCInResponse1 and A_BlockCCInResponse2. Why a second action? Because we don’t know if the developer (or framework) will put the name attribute first in the input tag, or the value attribute. So we need a variation on our StartPatternRE. The second action is identical to our first action, with the only difference being in the StartPatternRE.

StartPatternRE=(?-s)<[^>]*?\svalue\s*=\s*(["']{0,1}).*?\1\s+[^>]*?name\s*=\s*(["']{0,1})(CreditCard|CardNumber|NewPassword|SecurityAnswer1)\3[\/\s>]

In our second action, everything else, even the BlockingMAsk, does not get changed.

Blocking Data in the Response block that is not a value attribute

The section above discussed how to block a value attribute in an input tag. Occasionally you might find PCI data echoed in the body of the response. Examples I’ve seen are sites that echo back both a bank routing number and an account number, sites that display stored credit card numbers for editing, and similar examples. When you have a site that does this, you need to find an example, cut the HTML code around the PCI data, paste it into a text editor, then modify the real PCI data to be something ‘fake’.Then use the online testing tools, and develop a RegEx that will block the PCI portion of the data.

Blocking Data in the CUI/SDK library

The CUI/SDK library is JavaScript that runs in the user’s browser, records DOM events on the page (it only sees your web application, nothing else) , and send information about the DOM event back to the Tealeaf system. Most customers want to record the keystrokes the user enters, so they can tell if the page design causes problems in data entry, and correct it if so. So, we need to make sure that PCI data is NOT recorded and sent back to tealeaf . Yes, we could write blocking rules to block the data when it is received, but itis much easier to properly configure the CUI/SDK library to exclude certain fields. In the CUI/SDDK, we can use wildcards to match the field names.

AS of Version 8.7, the file you need to look at is the tealeaf.js file. The location may change in the future, so you should search for the string tlFieldBlock. You will find a nested structure by this name, and one of the structure’s members will be “name” : followed by a pipe-delimited | string. The string will have the names of the input fields to block. The list of names should match the same list you have in the request and response blocking rules. An example is

tlFieldBlock: [

{"name": "CreditCard|CardNumber|NewPassword|SecurityAnswer1", "caseinsensitive": true, "exclude": false, "mask": function () { return TeaLeaf.Client.PreserveMask.apply(this, arguments); } }

One of the very nice tings about the CUI/SDK blocking is that the list of alternative field names is a true regular expression. password will match password, oldpasssword, newpassword, passwordchanged, etc. You can use the ^ at the start of an alternative, or $ at the end of an alternative, to anchor the alternative string to the start or end of the field name (Go look at the RegEx help online at the testing tools if you don’t know what start and ends anchors are all about).

The best way I’ve ever seen this managed, one customer had a fellow whose job included making fixes to the web site. A really sharp guy, he was not a member of development group, but of operations, and his job was to make fixes to the web site when something was broke, and send those fixes back to the development team for incorporation into subsequent releases. This person was given the responsibility to add the tealeaf CUI/SDK libraries to the web pages. As is usually the case, it was done within a page template (and in fact, was incorporated as part of a tag/library management solution). But since he could make changes to the web pages, he made sure all PCI input fields had clear names which included password or creditcard or cvv or similar. Then it was very easy for him to make sure the CUI/SDK blocking rule names matched these input field names. I hope that your company makes a similar decision, and lets the CUI/SDK implementer change input field names when necessary. It will certainly reduce maintenance costs!

Blocking data in XML Web Services and “one page” designs

Blocking PCI data in a XML web service or ina JSON update to a page’s DOM is no different than blocking data in a classic web application. We are still dealing with Request and Response blocks that make up a hit. During development of the PCI privacy rules, find examples of the tealeaf sessions having hits from the web service or one-page JSON DOM update, inspect the request and response block with the Replay tool (RTV or Replay server), write and test RegExs to block the data. If the XML service is answering requests on just a very small number of URLs, you should consider making the Rule look at just those specific URLs, to keep the rule efficient. No use inspecting other pages for patterns that will never exist, nyeh?

Encrypting data instead of blocking it

Instead of blocking PCI data, the data can be encrypted so that only authorized users are able to see the encrypted data in clear-text. Securing the access to the clear-text is accomplished by using specific Active Directory security groups. Details how to accomplish this are provided in the IBM Tealeaf CX Configuration Manual, in the sub-section Encrypting Data Filter under the section Privacy Session Agent. Please refer to this document for details. A summary of the steps is as follows:

Create an AD security group and populate it with authorized users
Use TMS to configure the Search Server Authentication

Add the AD security group to the Search Server Authentication dialog
Create a privacy key and assign it to the AD Security Group. Copy the key to the clipboard

Edit the privacy.cfg file

Add the privacy key to the bottom of the file in the [Keys] section with a new identifier, e.g. [Key03]
in the privacy blocking action for the fields you want to encrypt, change the Action from Block to Encrypt
Add the line Key=Key03 to the Action block

With this, an authorized user will see the clear-text value of a field during replay. Field values are stored in the Canister using the encrypted value, and indexed using the encrypted value, so searching for a clear-text value will not work.

Maintaining the blocking rules

One of the unfortunate facts of life in a tealeaf administrators job is that the web site will change over time. New PCI fields will appear in the application. the Tealeaf privacy rules will need to be maintained and updated to account for these changes. The hardest part of maintenance is getting notified that PCI-impacting changes are part of a new release. Here’s a partial list of some methods I’ve seen used to keep up to speed on this:

Attend new feature design and release meetings. Try to make sure somebody from the tealeaf admin team attends meetings where new features are being discussed, and they keep an eye (or ear) out for changes that discuss new credit card features or new password features. If a new PCI impacting feature is discussed, make sure the tealeaf team sees QA/Staging versions of the application, and implements new PCI rules BEFORE the application is released to production.
Setup a tealeaf event that looks for 15+ consecutive digits. This event will trigger if a credit card number comes through in clear-text. Of course depending on the web application, it may also trigger on any other string of 15+ characters, but tealeaf admins for the site should become familiar with the ‘normal’ volume, and keeping an eye on this event may catch new places where CCNums appear.
Regularly (weekly or bi-weekly) do a full-text search the Canister data for the strings password, passwd, passwrd, and even pass?*. Review the matches to see if new password fields have appeared.
Publish a regular (monthly) list of PCI-impacting fields you are blocking, and e-mail these to development teams or their management. Reminding these teams on a regular basis exactly what is PCI-impacting can help keep developers, new and experienced both, cognizant of the fact that they need to communicate PCI-impacting changes to the tealeaf team.

Conclusion

Blocking or encrypting PCI data to prevent it from being available to unauthorized tealeaf users is one of the most important tasks of the tealeaf admin staff. I hope you find this blog post useful for blocking the confidential information in your tealeaf capture. Please post feedback, errata or further questions in the comments. Happy TeaLeaf-ing!

Saturday, March 22, 2014

Tealeaf Ecosystem High-Level Block Diagram and Description

The following is a very high level block diagram and short description of ALL the components that make up a full tealeaf system, including the optional components. I’ve written this post as the way I implement systems, and other configurations are possible. For example, PCI blocking can be done at the HBR level or the Processing/Canister level, but doing so means more servers must be made PCI compliant, costing more money and carrying more risk. In another example, I’ve heard that few other implementation consultants bother to implement the Statistics database, but the effort is minimal and the data valuable, so I’ve never understood why not. Since this is a high-level description of the tealeaf components, I’ve chosen to describe them in declarative style without trying to enumerate all the possible options for configuring each component.

ATAP Tealeaf Architecture High-Level Block Diagram

[Edit 06/05/2014: A PDF version of this drawing can be found here.]

Below are descriptions of each block.

Network Components

This block encompasses all of the non-tealeaf servers that are involved with the duplication and transmission of the data packets.

Packet Duplication

Any network component (Tap, Switch Span, Load Balancer) that performs the actual duplication of TCP/IP packets. SSL decryption may be performed at this layer, or lower at the Passive Capture Appliances (PCAs) layer

Packet Transmission

The network component (direct-connect crossover cable, switch, Gigamon) that connects the duplicated packet stream to the Passive Capture Appliances (PCAs). These devices may also connect the duplicated packet stream to other devices, such as intrusion-detection systems that need the data.

Passive Capture Appliances (PCAs)

Redundant Linux servers connected to the duplicated packet stream that reassemble the packets into request blocks and response blocks, and these into hits. SSL decryption may be performed here, or higher at the Packet Duplication layer. PCI data blocking and/or masking should all be performed at this layer. The TLTSID sessionization value will be inserted here if not already present in a HTTP cookie from the request or response block. PCA servers should be treated as PCI critical components. Below this layer, no PCI data is present. PCA servers are managed via SSH and/or a tealeaf web console (GUI) that may only be accessed by tealeaf admins from the Health-Based Routers (HBRs).

Health-Based Routers (HBRs)

Redundant Windows servers connected to the PCAs whose primary purpose is to distribute the traffic stream to multiple Processing/Canister servers. Distribution is session-sticky (using the TLTSID cookie), and normally done with a statistical even-distribution algorithm to send roughly the same number of sessions to each Processing server.
Overall, this diagram is for a production tealeaf system, but a good corporate tealeaf implementation includes both development and QA tealeaf systems as well (much smaller of course). Developers of tealeaf events need a small stream of production data sent to the development tealeaf system in order for developers to have data against which to create new events. The HBR servers include the capability to extract specific sessions or a statistical random fraction of sessions from the production stream and send those sessions to the development tealeaf system, as indicated in the diagram.
The HBR servers monitor all of the Processing/Canister servers, and if any Processing server stops responding, the HBR servers take that Processing server out of rotation and re-distributes the traffic to other servers. If a Processing server comes back alive, the HBR servers begins sending traffic back to that Processing server. Processing servers are cycled and self-checked every night, at different times, and the HBR routers must take each Processing server out of rotation, allow sufficient time for most sessions on that server to end, recognize the Processing server has stopped responding (while it self-checks) and recognizes when it comes back on-line, and begin sending it data again.
HBR servers do no data storage, but operate on the data stream. Examples include robot identification (User-Agent or IP based); deletion of hits based on IP address or URL or any combination request or response patterns; rewrite the Remote IP address using the latest HTTP_X_FORWARDED_FOR value; copying cookie values like a SID to the appdata section for indexing; condensing the referrer domain to provide meaningful referrers; extracting price and currency information from a page; normalizing page URLS to remove locale country codes and many other operations.

Processing/Canister servers

Redundant Windows servers connected to the HBRs that provide the storage location for hits and process these hits looking for patterns. The duration that sessions persist for replay and analysis is a function of traffic density and the amount of disk space the Canister servers provide. Sessions are extracted from the canisters on demand to replay in the Replay Server or RealiTea viewer.
Processing of hits is shorthand verbiage for the complex pattern recognition performed by the tealeaf Events system. Pattern recognition is done against individual hits, against session metadata, and by combinatorial logic looking at all events that have occurred in the entire session. In addition to looking for patterns, match groups can be defined to extract substrings from the data, and the number of times each substring occurs can be recorded. Metadata such as the “time-into” a session that a pattern occurs can be extracted. Extracted data can be grouped into sets. Event processing and data extraction performed in the Processing/Canister servers can be very complex.
If the optional cxConnect Real-time data extractor is installed, the Processing servers create messages for each event configured for real-time extraction, along with information on each hit the event occurs on, and these messages are delivered via a tealeaf pipeline to the cxConnect server.

Reporting Server

Non-redundant Windows server connected to the Processing/Canister servers that poll the Processing servers for their traffic and event counts, and provides the primary web-based GUI interface for users to see reports on this data. Traffic and Event data is collected from the processing servers, and stored in the Tealeaf SQL Reporting database. Dimension aggregates are calculated and stored in the database during each collection run. Daily, Weekly and Monthly aggregates are also calculated periodically. The Reporting GUI provides an interface to create and view reports on this data. Every tealeaf PCA, HBR, and Processing server reports it’s raw traffic statistics to the Reporting server, which stores this information in the Tealeaf SQL Statistical database, and the information is available in the Reporting GUI.

Tealeaf SQL Database server

Non-redundant Windows server running Microsoft SQL Server, or a corporate server running Microsoft SQL server. Tealeaf creates and uses three SQL databases (System, Reporting, and Statistics). The SQL server is usually managed by the corporate DBA team, and not the tealeaf admins. The tealeaf admin team works closely with the DBAs to install, update, maintain, and backup these databases.

Replay Server

Usually a service running on the reporting Server, the Replay server may be configured as a stand-alone Windows server. The Replay server is used to create visual replay of user sessions from the data stream of hits (request/response pairs) stored in the Processing/Canister servers. The Replay is web-based, and does not require any executable to be installed on a tealeaf user’s computer. See also the RealiTea Viewer section for an alternative method of replay.

cxOverstat

A feature that can be turned on that allows the Replay server to display the heatmaps and other functions provided by the cxOverstat feature.

Archive Servers

Optional Windows server(s) whose purpose is to extract a subset of sessions from all Canister servers that meet a selection criteria (most often, a purchase session or a trade session), and store these session for a period of much greater duration that the Canister servers. For example, if Canister servers are ALL storing sessions for 30 days, the Archive servers may store purchase sessions for two years. These can also be configured as “non-tamperable”, which provides a hash-based mechanism to prove that a session replayed from the Archive server is the same session hat was originally captured from the web site.

TLI servers

Optional Windows server(s) that store and make available for replay certain static content of the web site. During replay, the images and JS files are loaded from the web site. With a TLI server, these files are stored each day, and during replay of a session for say, three weeks ago, the images and JS files as they were three weeks ago, are used in the replay. This provides better fidelity of replay that is closer to the actual historical session. The drawback is the amount of storage needed to keep historical static content.

cxConnect Servers

Optional Windows server(s) whose purpose is to provide the interface that extracts data from the tealeaf Processing/Canister servers and makes that data available to external systems. There are two separate data feeds available. The real-time data feed is a set of (configurable) selected event messages with parameters that are sent by the Processing/Canister servers to the cxConnect servers using a tealeaf transport pipeline. The cxConnect servers have two distribution choices for this data – log to file and/or send to to a TCP/IP listener on an external system. The real-time feed is most often used to feed a Complex Event Processing (CEP) listener system. In turn, these systems drive real-time decisioning systems that modify the web application’s responses to the user based on their past actions in the session. The other data feed available from the cxConnect systems is a scheduled (typically hourly or daily) batch extract of detailed information regarding users, sessions, hits, parameters, events, and dimensions. The information is stored in flat files that conform to the Microsoft SQL Bulk Load (BCP) format. tealeaf provides an example schema for creating a relational SQL database, and script jobs for loading the corresponding BCP files into these tables. This is the reference mechanism provided by tealeaf for putting the data into a fully relational database.
The following three pieces of tealeaf code are implemented on the web sites and native mobile applications

Cookie Injector

Very small piece of code running on the web servers that adds the three tealeaf cookies. The TLTSID is a non-persistent cookie whose value does not change as long as the browser window remains open. The TLTHID cookie is a unique identifier assigned to each hit. The TLTUID is a persistent cookie left on the browser, whose value is sent each time the user revisits the site. All three cookies are 32-character GUIDs.

Client-Side Recording (CUI/SDK) Library

JavaScript (JS) library added to the web site and called from the web pages. This library implements the recording of the DOM events on the web page, and transmits the page’s DOM event information (mouse actions, keyboard actions, page rendering time, etc.) back to the tealeaf system. Includes a target page to be added to the web site that the library will call back to. The CUI/SDK is a very important piece to implement properly for sites that use anything similar to the “one-page” technique. The implementation of the CUI/SDK provides a much higher fidelity replay of the session.

Mobile Library

JavaScript (JS) library added to the Mobile web site and to the mobile device native application (IOS or Android) called by the device.This library implements the recording of the user’s interactions with the device and the page. Screen click, key-press, swipe, pinch, rotation and other page interactions are recorded and transmitted back to the tealeaf system. The implementation of the CUI/SDK provides a much higher fidelity replay of the session.
The following executable is installed onto the tealeaf user’s computer for replay of sessions

RealiTea Viewer

This is an executable program that can be installed onto a tealeaf user’s computer to allow for replay of user sessions. In addition to just replay, it allows multiple sessions to be downloaded to the user’s computer, and provides searching and analysis capabilities for patterns across these downloaded sessions. It includes the ability to customize the view panes and data fields displayed for both hits and for session metadata.
The following systems are not part of the tealeaf ecosystem, but if they exist in the company, they may be fed with data from the optional cxConnect servers.

SQL Relational Database Server for tealeaf events

Non-redundant Windows server running Microsoft SQL Server, or a corporate server running Microsoft SQL server. Tealeaf provides a reference schema for a relational database that links together sessions, hits, query parameters, events, and dimensions. The cxConnect data extractor populates these tables. These tables provide a very rich source for analytics against the user behaviors.

Real-Time decisioning systems

A system that modifies the web page contents based on the user’s past actions. Usually some kind of Complex Event Processing system tied into the web servers.
The following software constructs are resident in the corporate Active Directory structures.

Active Directory Security groups for tealeaf

At a minimum there are two AD security groups. One enumerates the userids which are allowed to use the tealeaf reporting GUI and allowed to access the session data stored in the Processing/Canister servers for replay. The other enumerates the userids given access to tealeaf at an administrative level. These may be Global groups or Domain-local, meaning that userids from multiple forests are supported if desired. Additional AD security groups may be created for teams such as fraud investigation teams, which are given access to encrypted PCI data, should the system be configured to encrypt certain fields instead of blocking them. Only member of these specific AD groups will be able to see the PCI data in clear text. Normal tealeaf users and tealeaf administrators see encrypted gibberish. PCI fields that are blocked instead of encrypted are always replaced with ‘X’ for all users.

Conclusion

I hope this overview is useful in understanding at a very high level all the components of the tealeaf ecosystem. As with all large complex computer systems, the features and components are evolving, so this post will eventually become obsolete. However, as of tealeaf version 8.8 in the spring of 2014, this should be a pretty complete picture
Feedback, comments, and questions are always welcome. Happy Tealeaf-ing!

Wednesday, February 5, 2014

Tealeaf Dimensions

Lets explore some usual and unusual ways to use dimensions in tealeaf reports. I’ll assume you know the basics of what dimensions are, and the standard way of using them. To review quickly, and event’s value can be stored in a dimension. When you look at a report for that event, in a hour’s period, you can see how many times the event fired (the event’s count) for that hour, or, you can look at how many times each value occurred in that hour. Of course, you can look at more than an hour – but lets keep the verbiage and phrasing simple. When you setup the dimension, you can tell the dimension to record every value the event ever has, or, you can create a whitelist – a list of values that you want to see counted. There are blacklists too, but even after 14 years, I’ve very seldom found a use for those.

Now lets digress just a bit, and explore what dimensions and whitelists do not do. A value in a whitelist is not an event condition. You cannot say “trigger this event only if a whitelist value is present”. Nope. If you have three values in the whitelist, and the event fire 1000 times, but none of the event’s values are in the whitelist, then the report will show the event occurred 1000 times, and the report will list ‘other’ as the sole dimension value. So you see, the whitelist does not prevent the event from firing – it just identifies the specific values you want to see details about.

Another thing dimensions cannot do (at least, as of release 8.7x) is trigger an alert. Often we want to trigger an alert if a specific value of an event occurs, or if an event’s value matches one of a list of such values. Alas, it just doesn’t work that way. An Alert gets triggered based on the event’s count, not it’s whitelist’s value’s count. So,you cannot use a whitelist to trigger an alert. Hope this tidbit saves somebody a wasted day trying to make that work.

As an aside – if you need to create an alert for specific event values; 1) create a building-block event, setting this event’s value to whatever you are looking for. 2) Create a second event that triggers only when the first event fires. Put your list here, and compare the list to the first event’s value. If the first event’s value matches the list, the second event will fire. Use the second event to drive the alert. Therefore, the alert is based on the second event, and the second event only fires if the value of whatever you are looking for matches the list. If your conditions are complex (an IP in the range of blah to bal; OR a userid of ‘johnDoe’; OR a useragent string containing ‘Snap!’) this technique makes it pretty easy. Individual Building Block events whose values are IP, LoginId, and useragent, then a single event that looks at each Building Block event, has the specific list of interesting values for each, and combines results of the list comparasion according to your needs to determine if the event fires. We do this for ‘Session of Interest’ Alert, which goes to security teams (in real time) whenever a specific set of conditions occur. Then they can go watch that session. If enough folks leave a comment expressing an interest, I’ll post example code for the list matching.

Back to the main thread <grin>

Dimension value counts have to be recorded, hence events with dimensions contribute to the size of the database aggregation tables. Looking at the Tealeaf DB statistics page in the Tealeaf portal’s administration view, lots of tables end in _AGG. If any get over about 40M rows, you are going to see performance degradation in the Data Collector’s speed, in report generation, and surprisingly, in the time it takes the event editor to commit new event definitions. Get too many events with large numbers of unique dimension values, and you’ll be unpleasantly surprised to find it takes fifteen minutes to “save” a new event.

But dimensions can be put to great use with a little care. Anything with a small number of unique values are perfect candidates. If you are interested in the viewing choices your customers make for a particular product, and want to see their choices broken down by product size or color, use a dimension for size, and another for color. Fire the event on a specific product SKU, record the customer’s size and color choice in a report group associated with the event, and you can see a report of how often the product was viewed in each size and color combination. Have a number of similar products? Create the event to fire on any of a list of product SKUs, and, add the product SKU as a dimension in the report group. Now your report will show multiple SKUs, sizes and colors. Just don’t go overboard – recording too many different SKUs in this one event will lead to aggregation table bloat, again.

Another fun thing I like to do with with Dimension Groups is to create histogram reports.You know dimensions and whitelists let you record unique values and dimension groups let you group these values. One good example is to use the Session Duration session variable. On each hit, the system updates this session variable with the the current duration of the session. You can create an event that records this. Lets say you create an event on the “purchase complete” page, and record the value of the Session Attribute “Session Length Running Time (sec)”. Now you have, for each purchaser, how long it took for them to get from start to purchase. Now lets group these counts into buckets. Create a Dimension Group, and each element is 0-1799;1800-3599;3600-5399,… etc. Populate this Dimension group with the Session Attribute “Session Length Running Time (sec)”. The internal name of this attribute is S_TOTAL_TIME. Assign this Dimension group to a Report group and the Report Group to the event. The resulting report shows 1) how many purchases in the hour; 2) how many customer made their purchase in 0-3 minutes; 3-5.9 minutes; 5-8.9 minutes; I find this technique to be interesting – it helps the business visualize how long it takes most customers to make a purchase. You can extend it so the purchase event triggers for a specific product – and then compare the two reports to see if different products take differing time to purchase. I wish there was some way to put the comparison up on a single report, but I haven’t found a way to do that.

One more I have not tried personally, but would be interesting… The Report Group above defines how long it takes to reach a purchase from the start of the session. What if you want to know how long it takes to get from Event A to Event B, and report that as a histogram? For example – start of checkout to purchase complete? Go create Events A and B for Start of Checkout-BB and End of Checkout-BB, and record the Session Length Running Time (sec) as their values. Create an Event C “Time in Funnel (sec)”, that fires when both events have fired. Edit it in Advanced mode, and record the value of this event as the difference between the value of Event B – the value of Event A. Note that you may need to account for multiple event A’s occurring before the Event B, so make sure you use the latest (highest numbered) event A in the event A collection. Now use the value of Event C to populate a Dimension Group along the lines of the last example, and use the Dimension Group to populate a Report Group assigned to Event C. Report on Event C, and the report showing how many users spent how many seconds in the checkout funnel is available. I have not tried this one yet – please let me know if you encounter any difficulties with this one.

I hope these work for you, and spark some more interesting ideas on using dimensions!

Friday, October 19, 2012

The ErrorCode and ErrorMessage Events

A really useful group of events are ones that can detect the error messages displayed on your site. But these can be difficult to get right. Lets go through creating events to detect error messages, and some pitfalls to watch out for.

What’s the difference between an error message and an error code? In some sites, no difference at all – all you have to work with is a message string the user sees. But if you have a site that supports multiple languages, you may have the same error message presented in two or more language-specific strings. If you have this situation, you are going to want to know not just the specific string being shown, but how often each error condition displayed regardless of the language. For this, you will need for the site developers to insert an error code (it can be "non-displayable” in the page), as well as the error message.

Here’s an example: class="fError">! There are no ...<!--ErrCode:FA65—></

After you’ve isolated the error string from it’s surroundings, to break that string into the language-specific error message as well as the error code requires only two simple regular expressions:

$REMessage = /(.*?)<--/;
$RECode = /<!--ErrCode:(.*?)-/;

Getting the error string separated from its surroundings, though, that may be tricky…

First, Last, Count, and Aggregate Field Errors

If the site has only one error message on a page, a simple hit attribute and pattern is all you need, but in practice it is seldom that simple. Most often, the site may have multiple error messages on a single page. In order to understand which error messages are most common, we need to extract every error message.Hit attributes combined with Basic mode events can record the first error, or the last error, or the number of errors on a single page, but there is no provision in tealeaf for an event to record a collection of errors. The best we can do is create an event that will aggregate all of the errors into one string. We are going to need an advanced mode event to do the aggregation.

Visible and Invisible Errors

The event to aggregate the errors would be much easier if all error messages were visible, but alas that is not always so. In many sites, the developers will deliver all the error messages as part of the HTML of the page, but will use a CSS display attribute to control the visibility of the message. So we need to further refine our event so it only records Visible errors. Here are examples that show the same error message in multiple visible and invisible forms.

Visible:

class="fError">! There are no ...</div>
class="fError"><br />! Make a selection for ...</span>
class="fError">! Please enter a valid...<BR></span>
class="fError">! Please select a proper option.</span></p> – Note that the developers left out the error Code!
class="fError" style="color:#CC0000;">! Please enter a New ...<br/></span>

Invisible:

_CustomValidator1" class="fError" style="color:#CC0000;display:none;"></span>
<p class="fError" visible="False"></p> -- Note that this doesn't even have the trailing <span>
class="fError">< – This is an “empty” error message

Ignored: (our business users have told us don’t want to include this error when analyzing message)
class="fError" style="font-size:80%">! Your membership is expired.</div><

Regular Expression Patterns

The following Regular Expressions (RE) will capture each type of field error. There is some overlap - patterns which are both invisible and empty will fire both of their respective REs. Instead of writing down what each pattern is doing, let me send you to a on-line tool for making REs that will do a great job – just paste the RE into the online tool, check ‘Explain’ and hit submit. I’ve looked at dozens of RE test tools – this is my personal favorite, and the one I use when developing advanced mode events: http://myregextester.com/index.php

$REVisible = /class="fError(?:" style="color:#CC0000;){0,1}">(.+)<\/(?:span|div)/;
$REInvisible = /(?:class="fError" style="color:#CC0000;display:none;"|class="fError" visible="False">)/;
$REEmpty = /class="FError"><\/(?:span||div)/;
$REIgnored = /class="FError" style="font-size:80%">/;

Known Error Patterns

Grouping together the known, invisible, empty, and ignored patterns into one pattern “Known”:

Event Coverage

Because of the complexity, we’ll also want to have some code to check our event coverage – are we counting every visible error, and ignoring every invisible error? are there any errors that don’t fit our definition of either visible or invisible or ignored?

We can do this on every page with the following events

[BB] count every occurrence of “class=”ferror” (and alternate formulation with one apostrophe, and no apostrophe or double-quote characters, and zero or multiple white-space characters.
- /class\s*=\s*([‘”])ferror\1\s/ig
[BB] count every occurrence of Known Field Errors
[E] count every occurrence of unknown field errors on a page (total – visible – invisible)

Events and Dimensions for Error Codes and Messages.

Finally! We get to the actual events, dimensions, and report groups. In the interest of brevity, I’ll present the Aggregate Visual Field Error Code information. If you want the details on First Visible Field Error and Last Visible Field Error on a page, you can copy/modify these. I will show you the “guts” of the javascript in the advanced mode event for each of these, as well.

All of these events depend on the triggering event G:Err:NumberOfVisibleFieldErrorCode:E When the event G:Err:NumberOfVisibleFieldErrorCode:E appears on a hit, then each of the following events are evaluated. In this event, we set the conditions “statuscode = 200, URL does not include tealeaftarget”, and whatever other conditions you may want to exclude (for example,excluding a specific domain).

In order to search repeatedly through a response, we need a hit attribute (HI) that will extract the response; G:ResponseBody with start tag of <html and end tag </html. In each of the following events, we use the HI to extract the entire response body into a local variable. When Regular Expressions are applied repeatedly to a string, there are internal pointers that remember how much of the string was parsed. If we tried to call the HI function for the response body in each loop , those pointers get reset. So we make a local copy, and loop the regular expression against the local copy.

Events:

G:Err:AggregateVisibleFieldErrorCodeOnPage:BB – [ADV] string consisting of all the visible field-level error codes on a page. Iterate over the response body for the REVisible pattern. Each time a match occurs, extract from the match string the error code. Accumulate the multiple error codes with ‘:::’ as a separator. Store this string as the value of the event.

function YOURNAMESPACE$E__G_ERR_AGGREGATEVISIBLEFIELDERRORCODEONPAGE_E__634853855183542782()
{
    if ($F.getLastFact("YOURNAMESPACE.F_E__G_ERR_NUMBEROFVISIBLEFIELDERRORCODEONPAGE_E__634761380708236960").HitNumber == $H.HitNumber)
    {
        $REVisible = /class="fError(?:" style="color:#CC0000;){0,1}">(.+)<\/(span|div)/ig;
        var $str = $P["YOURNAMESPACEL.P__G_RESPONSEBODY__634753590986008896"].firstValue();
        var $cnt = 0;
        var $resultstr = "";
        while ($matches = $REVisible.exec($str)) {
            if ($matches != null) {
                if ($matches[1] != null) {
                    $RECode = /<!--ErrCode:(.*?)-/;
                    var $RECodeMatch = $RECode.exec($matches[1]);
                    if ($RECodeMatch != null) {
                        if ($RECodeMatch[1] != null) {
                            $cnt++;
                            $resultstr += $RECodeMatch[1]; $resultstr += ":::";
                        }
                    }
                }
            }
        }
        if ($cnt > 0) {
            // Set fact for Report Group: No Dimension Report Group
            $F.setFact("YOURNAMESPACE.F_E__G_ERR_AGGREGATEVISIBLEFIELDERRORCODEONPAGE_E__634853855183542782", $resultstr);
        }
    }
}

G:Err:FirstVisibleFieldErrorCodeOnPage:BB Iterate over the response body for the VisibleRE pattern. Each time a match occurs, do nothing until the last match. Then split the match string into message and code. Store the code string as the value of the event

G:Err:FirstVisibleFieldErrorCodeOnPage:BB Test the VisibleRE pattern against the response body. Take just the first match. Split the match string into message and code. Store the code string as the value of the event

$REVisible = /class="fError(?:" style="color:#CC0000;){0,1}">(.+)<\/(span|div)/ig;
var $matches = $REVisible.exec($P["UNITEDCONTINENTAL.P__G_RESPONSEBODY__634753590986008896"].firstValue());
if ($matches != null) {
    if ($matches[1] != null) {
        $RECode = /<!--ErrCode:(.*?)-/;
        var $RECodeMatch = $RECode.exec($matches[1]);
        if ($RECodeMatch != null) {
            if ($RECodeMatch[1] != null) {
                // Set fact for Report Group: No Dimension Report Group
                $F.setFact("UNITEDCONTINENTAL.F_E__G_ERR_FIRSTVISIBLEFIELDERRORCODEONPAGE_BB__634761678018624416", $RECodeMatch[1]);
            }
        }
    }
}

G:Err:AnyVisibleFieldErrorOnPage:E Is the reportable event we look at to see what errors are happening on a page. This event is a simple “OR” of the six BB events, so that, if the first/last/aggregate Visible Error Code/Message events fire, any of them, the Report Groups on this event will record the six values, and the URL (page) on which they occurred.

Note that recording all six dimension values may be overkill for your analytics group. I like to record all six for a few days, and review them with the business information consumers. If they agree that the aggregate dimensions are all they need, it’s easy to remove the unneeded dimensions. You will also want to be wary of the Message dimensions. While these are certainly more consumer-friendly than the error codes, storage for long strings of aggregated error messages may prove to be a concern. If you decide to record this information, keep an eye on the aggregate table usage over a few weeks, and plan to re-evaluate its usefulness and it’s storage cost.

The Results

After putting these together we can analyze the errors that happen on a specific page. We do this by looking at the event G:Err:AnyVisibleFieldErrorOnPage:E, and homing in on specific URL or groups of URLs, and looking at the error dimensions. When this event fires, we are recording (and viewing) the errors that appear on that page.

On what pages do field errors occur the most often?

On any specific URL, what are the most common visible field errors (aggregate of all errors on the page)?

What are the most common “last seen” error codes when a user abandons the checkout process

The truly actionable information you can glean is “last seen error when purchase is abandoned”. This is the favorite information for a lot of data consumers. Understanding this helps understand what it is about the site that most often impedes the purchase process.

This works because we defined our dimension to record the “lastest” value of a visible error event. We define an event that fires at the end of a session if we detect that abandonment took place (e.g. FPR:Abandoned:S, which is defined as FPR:ReviewRevenueR:S AND NOT FPR:ConfirmRevenueV:S), and we assign our six error dimensions to report groups assigned to this event. We don’t need the URL as part of this Report group, because as a end-of-session event, the URL dimension would haphazardly record whatever URL happened to be the last one in the session (and most often that would be the tealeaftarget which occurs when the uiSDK phones home the information for the last page). However, the “lastest” error messages may not be the error messages on the last page of the checkout process abandonment, if the user goes and visits other parts of the site, like searches, after their last checkout process page.

So this event is the closest we can get using the built-in tealeaf reporting tools. To get closer to the truth, a better approach would be to use cxConnect to record the error event in an external BI data warehouse, and use analysis queries to ask “for every abandoned sessions, which was the last checkout process page seen, and what were the errors shown”.

Within the limits of the tealeaf reports, here are the “last seen visible error codes when purchase is abandoned”.

We’ve filtered to eliminate the [Null} value. If a checkout process is abandoned, and there are no visible field errors in the session, then [Null] is recorded for this dimensions. In this specific site, many automated tools get to the first page of the checkout processes, producing a “Null” entry here when the automata “abandons” its checkout. An external BI data warehouse would allow better refining these numbers to eliminate robots.

Good luck error hunting! I hope this piece will help you better understand some of the event and dimension concepts in Tealeaf V8.

V8 Process Event Guidelines

Every e-commerce web site has one or more “processes” – sequential steps the user has to go through, to accomplish a goal. The most important process for most e-businesses is the “checkout” process (aka the “purchase” process). In this post, we will create discuss how to create events that track processes, and discuss the characteristics you should build into these events.

At the fundamental level, a client sends a request, and the server sends back a response (view). You should create two process events for each step, one to recognize and count the request, and another that recognizes the response. When tealeaf recognizes a request event, you know the user actively asked for the next step of the process, and when tealeaf recognizes the response, you know the system presented the information to the user. In my post V8 Event Naming Conventions I discuss how to create event names that reflect this difference. The request event can be based on the URL of the page, and the view event is best based on a unique pattern present in the response when the page is successfully rendered.

However, before we go much further, there are major differences between Apache/IIS/ColdFusion server technology when it comes to how the technology uses the URL, and this difference makes a huge difference in how the Request for a step is recognized. <Insert moew on rthis>

Now that we are all using the same event names to mean the same logical view of the steps regardless of our site’s technology, onward!

Conceptually, a simple checkout process is “Cart View”, “Billing Info”, “Final review” and Confirmation. This produces and eight step checkout processes

Process Step Events for Every Step

The eight steps that measure every step occurrence are P:CartViewR:E, P:CartViewV:E, P:BillingInfoR:E, P:BillingInfoV:E, P:FinalReviewR:E, P:FinalReviewV:E, P:ConfirmationR:E, and P:ConfirmationV:E. When we create the events, base the R events on the URL of the page, and V events on the page title or some other string in the response that positively and uniquely identifies the page. Set the event type as ‘count”, evaluate on every hit, and do not add any report group yet.

These events tell us how many times each process step occurs, even if they occur multiple times in a session.

Session Events for Each Step of the Process

The eight Session steps that measure “the session had one or more occurrences of a process step” are named P:CartViewR:S, P:CartViewV:S, P:BillingInfoR:S, P:BillingInfoV:S, P:FinalReviewR:S, P:FinalReviewV:S, P:ConfirmationR:S, and P:ConfirmationV:S. Create each of these based on the presence of the corresponding :E event being in the session. Set the event type as ‘count”, evaluate at the end of the session, and do not add any report group yet.

These events tell us how many visitors got to each step of the checkout process during their sessions.

Pyramiding the Session Events

Now string the process steps together by creating the following events. It’s easy to see why they are called pyramiding events. Event type is count, evaluate at the end of the session, and no report group. These tell us how many visitors saw each step, and did not miss a step. They are the most accurate source of information for conversion ratios, and will be used for error detection as well, as will be shown later.

P:Step1R:S - P:CartViewR:S

P:Step1R&1V:S - P:CartViewR:S AND P:CartViewV:S

P:Step1R&1V&2R:S - P:CartViewR:S AND P:CartViewV:S AND P:BillingInfoR:S

P:Step1R&1V&2R&2V:S -

P:Step1R&1V&2R&2V&3R:S -

P:Step1R&1V&2R&2V&3R&3V:S -

P:Step1R&1V&2R&2V&3R&3V&4R:S -

P:Step1R&1V&2R&2V&3R&3V&4R&4V:S - P:CartViewR:S AND P:CartViewV:S AND P:BillingInfoR:S AND P:BillingInfoV:S AND P:FinalReviewR:S AND P:FinalReviewV:S AND P:ConfirmationR:S AND P:ConfirmationV:S

Simple Count and Conversion Ratio Reports:

It is straightforward to create reports that show the counts for each event, and the ratio between any two of them. You can generate step-step ratios, or ratios from first step to any subsequent step, including confirmation.

In the chart examples throughout this post, the site had a seven-step process, so the pictures do not “quite” match up to the events described in the post. I wanted to use a very short hypothetical process for the text, to keep the post focused on core concepts. I trust you will not have any problem making the correlation and bridging the discrepancies between the charts and the text.

Pyramid Event Counts

Pyramid Event Ratios

Dimensions and Report Groups

How can we “slice” these numbers to get focused information? By using Dimensions and Report Groups! Lets examine the possibilities.

Things that are unchanged throughout the session (on the site I’m using for this post) include the AKAMI Country Code and the Session’s First Hit Referrer Domain

Things that may change at any time in the session include the Point Of Sale (POS) which the user can control (it changes the language) and is found in each response, the Currency of the purchase (which is based on the country in which the credit card is issued), and is always present in the CartView response, and the AmountList, which is a grouping, or “bucketing” of the cart’s sale amount. Another item that changes (but usually only once) is the customer’s loyalty program level if they sign-in.

Now, if we want to know the event counts for each country as reported by Akamai, we add a report group to each event that consists of a dimension that is populated by an event that fires on the first hit of the session and whose value is the value of the Akamai Country Code in a http header. Then in the report builder, we can drag the FirstSeenAkamaiCountryCode:E-F dimension to the chart of event counts, and filter on the top five. This groups the events by the country code that was seen on the first request of the session, and tells us which country has the highest number of visits to each step. Do this on the chart of conversion ratios, and it tells us which country has the highest (or lowest) conversion ratios

Now, if we want to know the event counts for referring domain, we add a report group to each event that consists of a dimension that is populated by an event that fires on the first hit of the session and whose value is the value of the domain portion of the HTTP_REFERRER. Then in the report builder, we can drag the DomainSessionReferrer:E-F dimension to the chart of event counts, and filter on the top five. This groups the events by the domain of the referrer that was seen on the first request of the session, and tells us which referrer has the highest number of visits to each step. Do this on the chart of conversion ratios, and it tells us which referrer has the highest (or lowest) conversion ratios.

If we want to know the event counts by POS, we add a report group to each event that consists of a dimension that is populated by an event that extracts (from every hit) the POS from the response, and in the report builder we can drag the POS:E-L dimension to the chart, and filter on the top five. This groups the step events by the POS that was last seen in the session. It is important to note that the POS can change after the last step of the process. The pyramid events are session events, and are evaluated at the end of the session. The value of POS stored in the report group for the pyramid events will be whatever POS was last seen in the last hit of the session.

Two-Dimensional Report Group

When the site is multi-currency, you cannot add the amounts of orders together. Adding Yen to Yuan to Reals to Peso to dollars makes no sense. It is important to consider both the amount and the currency together as a two-part object. We can accomplish this with a report group having two dimensions, Currency:E-L and AmountList:E-L. Populating these dimensions can be done with an advanced mode event. The details are beyond the scope of this already-too-long post, but can be found here. <insert>

Now in the report builder we can drag the Currency:E-L and the AmountList:E-L dimensions to the chart. We lose the graphical representation, but the tabular data shows which currency and amount buckets are the best performing. (I have no idea why there are [Null] entries – looks like I’ve got some troubleshooting to do :-)

Multi-dimension Report Groups

For fine-grained analysis, create report groups of multiple dimensions (up to four dimensions). For example, a report group consisting of the dimensions POS:E-L/LoyaltyStatus:E-L/DomainSessionReferrer:E-F lets us investigate the relationships between these three constraints on the purchase process events.

Other dimensions you may want to consider adding are TrafficTypeE-F,BrowserType:-F,BrowserVersion:E-F,SessionDurationList:H

Make sure you look at the TrafficType:E-F dimensions or via searching, to see if there are robots getting to the first part of your process

What is “abandonment”

Abandonment is simply any session that has the first step of the checkout process,and not the last step. Use these two conditions to create a simple session event for P:Abandonment:S Since this means our customer left empty-handed, companies want to spend a fair amount of resources to investigate why this happens.What companies need are a way to quantify how much revenue is lost when an abandonment occurs. For this example, we will focus on the amount (and currency) that was in the last P:CartViewV:E of the session.

Extracting “loss” on an abandonment

Scraping off the amount and the currency from a response can be tough. The details are here <insert xref>.To summarize you need two advanced mode events (BB) with three regular expressions (RE). Both events use one same RE to recognize the entire class element that encloses the the currency and amount substrings. Each BB event uses one of the other two REs, to split the substrings out of the class element string. If you are lucky, the developers have repeated that pattern everywhere. If not, you may be looking at multiple Regular Expressions in these events.

Two dimensions are populated by these BB events StoreLastSeenRevenueCurrency:BB and StoreLastSeenRevenueAmount:BB, called RevenueCurrencyList:E-L and RevenueAmountList:E-L. The dimensions are populated with the “Last” value of the BB event, which itself fires on each occurrence of the event P:CartViewV:E.

The two dimensions can be combined into a single report group and attached to any checkout process event, but wait…. lets not waste a multi-dimension report group. The final report group we attach to abandon will have more dimensions…

The amount and currency that is being abandoned…

Extracting Error Messages

See <insertt xref here> for details. To populate the two dimensions x and Y, It takes two BB events with multiple Regular Expressions. These BB events look at a page for any class, span , or div pattern that contains the error class element, extract either the language-specific error message or the language independent error code, repeats down the whole page view, and aggregates these one or more substrings into one large aggregate string.

You can attach these dimensions to a step event, but they will always have the last seen event value – even if the last seen error was two hits back. Instead, we create an event G:AnyVisibleFieldError:E, and attach the two dimensions to this event along with the URL (Normalized) dimension. This provides an event whose dimensions can tell us “what is the most common error” for any specific URL or groups of URLs. That’s pretty potent! If you attached just these to the P:Abandoned:S event, you would know what error messages were seen last most often when visitors abandoned. Again, lets not waste a multi-dimension report group..

Top Error messages…

To find out which error message was last seen most often and sort by revenue lost, create a report group with all four dimensionss we’ve discussed

a/b/c/d

Now chart the P:Abandoned:E event in the report builder to see the culmination of our efforts which error messages are seen for the largest amount of abandoned shopping carts, grouped by currency and amount list and errormessage (and errorcode)…

Throw in the revenue amount list to group these

The P:Abandoned Event

Evaluate at End of Session
Fires if teh session has a ProcessStart and NOT a ProcessEnd event
Numeric Type of value
Store the value amount shown at ProcessStart
Report group contains RevenueCurency:E-L; RevenueAmount:E-L and AggregateVisibleFieldErrorMessage:E-L

I hope this has helped you understand process events, and given you some ideas for expanding your use of tehm.