Tuesday, July 22, 2014

A Practical Answer to the Question “What to do about Robots?”

Every Tealeaf customer quickly discovers that their data stream contains a large number of robot sessions – web crawlers, scrapers, internal and external keep-alive monitors. Some of these visitors are polite, identify themselves, and honor your robots.txt file. Others are not so polite, try to stay anonymous, and go anywhere on your site they like. Soon after go-live, the tealeaf event administrators raise the question “What do we do about robots in our events”?
I used to belong to the school of thought that said – save everything in Tealeaf. My reasoning was, if a bot starts to have a negative effect on the site, I wanted to find it in Tealeaf. That remains a very important consideration. Over the years, I've found a lot of site performance degradations or even complete site outages attributable to runaway bots. The drawback to saving bot traffic in tealeaf is the complexity it causes in creating events. Creating a variation on every event to include totals, bot-only, and non-bot only is time-consuming. Creating the correct reports from these variations is both time-consuming and often very complex. Trying to communicate to the report consumers (business team executive stakeholders <grin>)  the “ifs, ands, and buts” of the data because of the impact of bots – well, lets just say I really would rather not have to.
So I wanted to come up with a solution that would remove bot traffic from the canisters so they wouldn't impact business events, yet still have the traffic available for analysis. I've come up with a solution that will take  them out of the data stream, but save their pertinent data to TLA files for later analysis if desired. A couple of advantages to this approach are:
  • You are going to be shocked by how much you save in RAM, Disk and CPU processing once the robot data is out of the data stream. Of course that is offset somwhat by the storage space it takes for the TLA files – but the solution drops response bodies, for an overall reduction in disk consumption.
  • You no longer have to take special care to eliminate robots from your event definitions, nor create special events to recognize robots. Your data stream is now “real users only”, to the extent that you can make it so.
First, one caveat- this post is NOT about how to recognize stealth robots. Later, we will show you how to treat certain IPs as ‘All traffic from this IP is Bot traffic’, but I'm going to assume you are doing some kind of external analysis of your data to identify stealth robots by their IP address.

The following picture shows the pipeline components necessary for doing the Bot elimination. A nice PDF of the drawing is here, and a zip file of the pipeline components that are discussed in this post can be found here.


Before we start, make sure you have the ‘use HTTP_X_FORWARDED_FOR’ radio button checked on the PCA server to ensure that the REMOTE_ADDR is pointing to the true remote IP address, not to the load balancer in your data center. This is covered in the PCA administrator's guide.

Start with the TLTREF session agent (SA). You should already have this in your pipeline, and using the browscap.csv file (and maintaining it!) There’s no reason for me to recap how to setup TLTREF and Browscap, it is well covered in the tealeaf documentation. This SA creates the [ExtendedUserAgent] section in the request block, and for bots recognized by user-agent, sets the TLT_TRAFFIC_TYPE=BOT

TLTREF session agent:
[TLTRef]
TypeName=TLTRef
DLL=SessionAgentTLTRef.dll
DownStreamConfigSection=PrivacyStealthRobotByIP
ReferenceMaster=true
UpdateInterval=600
ReferenceFileMaxSize=1000
URLReferenceFile=.\system\Reference_Path.txt
ServerReferenceFile=.\system\Reference_Server.txt
HostReferenceFile=.\system\Reference_Host.txt
AppReferenceFile=.\system\Reference_Application.txt
URLReferenceRules=true
URLReferenceRulesMode=cont
URLReferenceRulesFile=.\system\Reference_Path_Rules.txt
UseFullVirtualDir=False
NormalizeHost=True
NormalizedHostValues=WWW.<YourSiteHere>.COM;
NormalizeServer=True
NormalizeAppName=True
NormalizeURL=True
NormalizeURLExt=ACTION;ASMX;ASP;ASPX;CSS;DO;HTM;HTML;GIF;ICO;JPG;JS;JSP;JHTML;PDF;PHP;SWF;TXT
NormalizeURLStatusCode=0;200;204;206;301;302;303;304;400;402;403;404;410;500;501;502;503;504;505
AllowEmptyExtension=True
NormalizeURLRootDefault=DEFAULTPAGE
OutputReferenceStats=True
StatsKeepDays=30
OutputReferenceStatsMin=1
StatsReportingInterval=60
OutputUserAgentStats=True
ReferenceIDOffset=0
ReferrerParsing=True
ReferrerPrepend=REF_
AdvancedUAParsing=True
UAFilesDir=.\system\
PruningInterval=60
MaxCacheSize=2000
EnableUserAgentLogFile=True
UALog=False
UserAgentLogLevel=1 


Below (downstream) of TLTREF we are going to add the PrivacyStealthRobotByIP session agent, a privacy SA that uses the configuration file PrivacyStealthRobotByIP.cfg. The following is an example of the single rule in this privacy cfg file. As you can see, it uses a RegExpr to list IP addresses separated by the ‘|’ character, and anchors these alternations to the start and end of the REMOTE_ADDR value buffer.Any IP address that matches one of the alternatives sets the request’s TLT_TRAFFIC_TYPE to BOT, and the TLT_BROWSER to ‘Stealthbot-<IPaddress>’. The <IPaddress> used in the TLT_BROWSER is the value of the REMOTE_ADDR.

PrivacyStealthRobotByIP session agent:
[PrivacyStealthRobotByIP]
TypeName=PrivacyStealthRobotByIP
DownStreamConfigSection=RTARouteBots
ConfigFile=PrivacyStealthRobotByIP.cfg
DLL=SessionAgentPrivacy.dll
LogLevel=Error
LogRules=True

PrivacyStealthRobotByIP.cfg
[Rule1]
StopProcessing=True
Enabled=True
Tests=T_1,T_2
Actions=A_Traffic_Type, A_Browser

[T_1]
ReqField=REMOTE_ADDR
ReqOp=PartOfList
ReqVal=192.168.255.255;10.10.10.255

[T_2]
ReqField=REMOTE_ADDR
ReqOp=PartOfList
ReqVal=216.106.218.118;209.85.238.2;209.85.238.9

[A_Traffic_Type]
Action=ReqSet
ReqSetSection=[ExtendedUserAgent]
ReqSetField=TLT_TRAFFIC_TYPE
ReplaceString=BOT

[A_Browser]
Action=ReqSet
ReqSetSection=[ExtendedUserAgent]
ReqSetField=TLT_BROWSER
StartPatternRE=REMOTE_ADDR=(.*)
Inclusive=True
ReplaceString=Stealthbot-{g1}


Now that robots have been identified in the data stream, the next downstream session agent is the RTARouteBots SA. This SA is going to use the TRAFFIC_TYPE to pick off the Bot hits and send them off to a ChildRouter pipeline. This SA uses the file ‘Rule Scripts\RTARouteBots.ini file, which has a single rule which sets the pipeid to ‘1’ if the TRAFFIC_TYPE is ‘Bot’. In the RTARouteBots Session Agent, the PipelineConfig1 names the file HRCP_RouteBots.cfg. All Bots hits are going to be sent to the child pipeline defined in this cfg file, and all non-bot hits are going to continue on down the main pipeline.

RTARouteBots session agent:
[RTARouteBots]
TypeName=RTARouteBots
DLL=SessionAgentPipelineSplitter.dll
DownStreamConfigSection=<YourNextDownstreamSA>
ScriptTrace=OFF
RTAIni=RTARouteBots.ini
ResponseType=All
EnvironmentScript=EngineEnvironment.tcl
PreProcScript=RTA.PreProc.tcl
ActionScript=RTA.Action.tcl
PipelineConfig1=HRCP_RouteBots.cfg

RTARouteBots.ini:
TeaLeaf RTA

<RULE NUM="1" STATUS="ENABLED" DESCRIPTION="Route Bot Traffic Down PipelineID1" STOPRULES="YES">
  <GROUP TYPE="REQTest" REQF="TLT_TRAFFIC_TYPE" REQOP="CONTAINS" SRCHCASE="NO" REQVAL="BOT">
    </GROUP>
  <GROUP TYPE="SetPipeID" PipeID="1" Type="SET">
    </GROUP>
</RULE>

This ends our changes to the main pipeline. Simple add the two new session agents to your existing pipeline just downstream of the TLTREF session agent, and in the RTARouteBots SA, define your DownstreamConfigSection appropriately.

The Hit Router Child Pipeline file HRCP_RouteBots.cfg file is derived from a standard child pipeline. It has only a couple of session agents; Decouple, a Privacy SA used to drop hit response buffers, and Archive. Just after Decouple, the Privacy SA PrivacyDropBotHitResponseBodies  uses the configuration file by the same name, PrivacyDropBotHitResponseBodies.cfg. This file has a single rule that fires on every hit, and an action that removes the response bodies from the hit. There is almost no reason why you want to keep the pages the robot saw in your archive files, and the savings in the size of the file is usually about 10-fold. In the Archive SA, the archive file location is defined, along with instructing the archiver to roll the archive file at midnight every day, and also roll the file during the day if it exceed 500MB.

HRCP_RouteBots.cfg:

#This child pipeline is intended to get BOT only traffic from the main pipeline
[StatusMaster]
DLL=StatusMaster.dll
AppName=BotTraffic
[Globals]
DownStreamConfigSection=DecoupleToBotArchive
#DownStreamConfigSection=DecoupleToBotSocket
#DownStreamConfigSection=Null
[DecoupleToBotArchive]
TypeName=DecoupleToBotArchive
DLL=SessionAgentDecouple.dll
MaxQueueSize=2500
DownStreamConfigSection=PrivacyDropBotHitResponseBodies 
[PrivacyDropBotHitResponseBodies]
TypeName=PrivacyDropBotHitResponseBodies
DLL=SessionAgentPrivacy.dll
DownStreamConfigSection=BotArchive
ConfigFile=PrivacyDropBotHitResponseBodies.cfg
LogRules=True
[BotArchive]
TypeName=BotArchive
DLL=SessionAgentArchive.dll
LogDirectory=D:\BotTrafficArchives
FileID=BotTraffic
MaxLogSize=500MB
RollTime=00:01
QuotaPctFree=2
QuotaScanTime=20
QuotaDLL=DiskQuota.dll
############# Below this is for sending to another canister on another server (VM)
[DecoupleToBotSocket]
TypeName=DecoupleToBotSocket
DLL=SessionAgentDecouple.dll
MaxQueueSize=2500
DownStreamConfigSection=BotSocket
[BotSocket]
TypeName=BotSocket
DLL=SessionAgentSocket.dll
Server=<Server_Name_Here>
Port=1966
[Null]
TypeName=Null
DLL=SessionAgentNull.dll

PrivacyDropBotHitResponseBodies.cfg:

[Rule1]
Enabled=True
Actions=A_DropResponse
[A_DropResponse]
Action=DropResponse

That's it!  There is a link here for all of these files. Edit your copies of these cfg files for the IP addresses, the location where you  want the TLA files, and the RTARouteBots SA’s DownStreamConfigSection, drop them into your pipeline, restart the transport service, and watch the bots disappear from your canisters and appear in the filesystem in TLA files!

Maintenance (ugh!)
You will need to write a scheduled task that runs every night and removes BOT TLA files older than say, 14 days (whatever works for you). Type the word ‘task’ into the search box and the Windows task scheduler will appear near the top of the results – open it and setup your task here.

Maintain your browscap.csv file and your list of Stealth IP addresses regularly.
Keep an eye on Bot traffic to ensure there are no real users (orders page URLs) that appear in these files.


Analysis
You can open a .TLA archive file with RTV, and do analysis on the bot traffic that way. But it’s not very good at summarizing the data or finding the heaviest-hitters. I’ve found that a very powerful tool for analysis is to use Excel Pivot tables.If you are not a pivot table expert – Google it! There are tons of resources for learning it. Really – you will sooo thank yourself for taking the time to learn how to use these! I’m going to show you how to convert a TLA into a CSV file ready to be consumed by a pivot table.

For analysis, it is important to know that a TLA file is just a ZIP file. You can open a TLA file with 7-Zip. Each TLTSID produces a subfolder, and within each TLTSID’s subfolder are the .req files for each request. You can open the .req files with any editor, and see a very familiar REQ block.

Writing a PowerShell script to read the TLA files and produce a .CSV summary was a bit persnickety. I’ve included one in the zip package that does a simple analysis. It works on just one TLA file at a time, and it doesn’t take arguments - you’ll need to edit the first four lines to setup things like input and output file names. If I get sufficient interest and comments, I’ll consider making the script user-friendly and production–ready.

The output of the PowerShell  script is a .CSV file. Open it with Excel, hit ‘ctrl-A’ to select everything, and Insert->PivotTable->PivotChart (then take the defaults by clicking on ‘OK’).

I've done a fifteen minute video capture showing the kinds of analysis that you can do on this data. The .mp4 file of this video can be found here .
Here is the result showing how many hits came from each BOT for a two hour period. The default set of properties included in the simple analysis will let you pivot on TLT_BROWSER, TLT_BROWSER_VERSION, URL, TLT_URL, STATUSCODE, REMOTE_ADDR, TLTSID and any combination of these. You can find out which bot hits you the most often, how many are producing status code 500 (and how many) errors, and if any are hitting pages that you don’t have in your TLTREF SA’s NormalizeURLExt list (look for places where URL=Others). Remember to save the file as an Excel Worksheet to preserve your pivot table work.



Next Steps:
If anyone extends the powershell script  for additional features , or comes up with additional analysis tricks for the CSV file, be sure to post and share them in the Tealeaf Users Group on LinkedIn. Ditto for any ideas you have on finding the IP addresses for Stealth bots.

I hope this technique helps you write tealeaf events that are better focused on the business needs, and also provide a good analysis of your robot traffic. Happy TeaLeaf’ing!