I used to belong to the school of thought that said – save everything in Tealeaf. My reasoning was, if a bot starts to have a negative effect on the site, I wanted to find it in Tealeaf. That remains a very important consideration. Over the years, I've found a lot of site performance degradations or even complete site outages attributable to runaway bots. The drawback to saving bot traffic in tealeaf is the complexity it causes in creating events. Creating a variation on every event to include totals, bot-only, and non-bot only is time-consuming. Creating the correct reports from these variations is both time-consuming and often very complex. Trying to communicate to the report consumers (business team executive stakeholders <grin>) the “ifs, ands, and buts” of the data because of the impact of bots – well, lets just say I really would rather not have to.
So I wanted to come up with a solution that would remove bot traffic from the canisters so they wouldn't impact business events, yet still have the traffic available for analysis. I've come up with a solution that will take them out of the data stream, but save their pertinent data to TLA files for later analysis if desired. A couple of advantages to this approach are:
- You are going to be shocked by how much you save in RAM, Disk and CPU processing once the robot data is out of the data stream. Of course that is offset somwhat by the storage space it takes for the TLA files – but the solution drops response bodies, for an overall reduction in disk consumption.
- You no longer have to take special care to eliminate robots from your event definitions, nor create special events to recognize robots. Your data stream is now “real users only”, to the extent that you can make it so.
The following picture shows the pipeline components necessary for doing the Bot elimination. A nice PDF of the drawing is here, and a zip file of the pipeline components that are discussed in this post can be found here.
Before we start, make sure you have the ‘use HTTP_X_FORWARDED_FOR’ radio button checked on the PCA server to ensure that the REMOTE_ADDR is pointing to the true remote IP address, not to the load balancer in your data center. This is covered in the PCA administrator's guide.
Start with the TLTREF session agent (SA). You should already have this in your pipeline, and using the browscap.csv file (and maintaining it!) There’s no reason for me to recap how to setup TLTREF and Browscap, it is well covered in the tealeaf documentation. This SA creates the [ExtendedUserAgent] section in the request block, and for bots recognized by user-agent, sets the TLT_TRAFFIC_TYPE=BOT
TLTREF session agent:
[TLTRef] TypeName=TLTRef DLL=SessionAgentTLTRef.dll DownStreamConfigSection=PrivacyStealthRobotByIP ReferenceMaster=true UpdateInterval=600 ReferenceFileMaxSize=1000 URLReferenceFile=.\system\Reference_Path.txt ServerReferenceFile=.\system\Reference_Server.txt HostReferenceFile=.\system\Reference_Host.txt AppReferenceFile=.\system\Reference_Application.txt URLReferenceRules=true URLReferenceRulesMode=cont URLReferenceRulesFile=.\system\Reference_Path_Rules.txt UseFullVirtualDir=False NormalizeHost=True NormalizedHostValues=WWW.<YourSiteHere>.COM; NormalizeServer=True NormalizeAppName=True NormalizeURL=True NormalizeURLExt=ACTION;ASMX;ASP;ASPX;CSS;DO;HTM;HTML;GIF;ICO;JPG;JS;JSP;JHTML;PDF;PHP;SWF;TXT NormalizeURLStatusCode=0;200;204;206;301;302;303;304;400;402;403;404;410;500;501;502;503;504;505 AllowEmptyExtension=True NormalizeURLRootDefault=DEFAULTPAGE OutputReferenceStats=True StatsKeepDays=30 OutputReferenceStatsMin=1 StatsReportingInterval=60 OutputUserAgentStats=True ReferenceIDOffset=0 ReferrerParsing=True ReferrerPrepend=REF_ AdvancedUAParsing=True UAFilesDir=.\system\ PruningInterval=60 MaxCacheSize=2000 EnableUserAgentLogFile=True UALog=False UserAgentLogLevel=1
Below (downstream) of TLTREF we are going to add the PrivacyStealthRobotByIP session agent, a privacy SA that uses the configuration file PrivacyStealthRobotByIP.cfg. The following is an example of the single rule in this privacy cfg file. As you can see, it uses a RegExpr to list IP addresses separated by the ‘|’ character, and anchors these alternations to the start and end of the REMOTE_ADDR value buffer.Any IP address that matches one of the alternatives sets the request’s TLT_TRAFFIC_TYPE to BOT, and the TLT_BROWSER to ‘Stealthbot-<IPaddress>’. The <IPaddress> used in the TLT_BROWSER is the value of the REMOTE_ADDR.
PrivacyStealthRobotByIP session agent:
[PrivacyStealthRobotByIP] TypeName=PrivacyStealthRobotByIP DownStreamConfigSection=RTARouteBots ConfigFile=PrivacyStealthRobotByIP.cfg DLL=SessionAgentPrivacy.dll LogLevel=Error LogRules=True
PrivacyStealthRobotByIP.cfg
[Rule1] StopProcessing=True Enabled=True Tests=T_1,T_2 Actions=A_Traffic_Type, A_Browser [T_1] ReqField=REMOTE_ADDR ReqOp=PartOfList ReqVal=192.168.255.255;10.10.10.255 [T_2] ReqField=REMOTE_ADDR ReqOp=PartOfList ReqVal=216.106.218.118;209.85.238.2;209.85.238.9 [A_Traffic_Type] Action=ReqSet ReqSetSection=[ExtendedUserAgent] ReqSetField=TLT_TRAFFIC_TYPE ReplaceString=BOT [A_Browser] Action=ReqSet ReqSetSection=[ExtendedUserAgent] ReqSetField=TLT_BROWSER StartPatternRE=REMOTE_ADDR=(.*) Inclusive=True ReplaceString=Stealthbot-{g1}
Now that robots have been identified in the data stream, the next downstream session agent is the RTARouteBots SA. This SA is going to use the TRAFFIC_TYPE to pick off the Bot hits and send them off to a ChildRouter pipeline. This SA uses the file ‘Rule Scripts\RTARouteBots.ini file, which has a single rule which sets the pipeid to ‘1’ if the TRAFFIC_TYPE is ‘Bot’. In the RTARouteBots Session Agent, the PipelineConfig1 names the file HRCP_RouteBots.cfg. All Bots hits are going to be sent to the child pipeline defined in this cfg file, and all non-bot hits are going to continue on down the main pipeline.
RTARouteBots session agent:
[RTARouteBots] TypeName=RTARouteBots DLL=SessionAgentPipelineSplitter.dll DownStreamConfigSection=<YourNextDownstreamSA> ScriptTrace=OFF RTAIni=RTARouteBots.ini ResponseType=All EnvironmentScript=EngineEnvironment.tcl PreProcScript=RTA.PreProc.tcl ActionScript=RTA.Action.tcl PipelineConfig1=HRCP_RouteBots.cfg
RTARouteBots.ini:
TeaLeaf RTA <RULE NUM="1" STATUS="ENABLED" DESCRIPTION="Route Bot Traffic Down PipelineID1" STOPRULES="YES"> <GROUP TYPE="REQTest" REQF="TLT_TRAFFIC_TYPE" REQOP="CONTAINS" SRCHCASE="NO" REQVAL="BOT"> </GROUP> <GROUP TYPE="SetPipeID" PipeID="1" Type="SET"> </GROUP> </RULE>
This ends our changes to the main pipeline. Simple add the two new session agents to your existing pipeline just downstream of the TLTREF session agent, and in the RTARouteBots SA, define your DownstreamConfigSection appropriately.
The Hit Router Child Pipeline file HRCP_RouteBots.cfg file is derived from a standard child pipeline. It has only a couple of session agents; Decouple, a Privacy SA used to drop hit response buffers, and Archive. Just after Decouple, the Privacy SA PrivacyDropBotHitResponseBodies uses the configuration file by the same name, PrivacyDropBotHitResponseBodies.cfg. This file has a single rule that fires on every hit, and an action that removes the response bodies from the hit. There is almost no reason why you want to keep the pages the robot saw in your archive files, and the savings in the size of the file is usually about 10-fold. In the Archive SA, the archive file location is defined, along with instructing the archiver to roll the archive file at midnight every day, and also roll the file during the day if it exceed 500MB.
HRCP_RouteBots.cfg:
#This child pipeline is intended to get BOT only traffic from the main pipeline
[StatusMaster]
DLL=StatusMaster.dll
AppName=BotTraffic
[Globals]
DownStreamConfigSection=DecoupleToBotArchive
#DownStreamConfigSection=DecoupleToBotSocket
#DownStreamConfigSection=Null
[DecoupleToBotArchive]
TypeName=DecoupleToBotArchive
DLL=SessionAgentDecouple.dll
MaxQueueSize=2500
DownStreamConfigSection=PrivacyDropBotHitResponseBodies
[PrivacyDropBotHitResponseBodies]
TypeName=PrivacyDropBotHitResponseBodies
DLL=SessionAgentPrivacy.dll
DownStreamConfigSection=BotArchive
ConfigFile=PrivacyDropBotHitResponseBodies.cfg
LogRules=True
[BotArchive]
TypeName=BotArchive
DLL=SessionAgentArchive.dll
LogDirectory=D:\BotTrafficArchives
FileID=BotTraffic
MaxLogSize=500MB
RollTime=00:01
QuotaPctFree=2
QuotaScanTime=20
QuotaDLL=DiskQuota.dll
############# Below this is for sending to another canister on another server (VM)
[DecoupleToBotSocket]
TypeName=DecoupleToBotSocket
DLL=SessionAgentDecouple.dll
MaxQueueSize=2500
DownStreamConfigSection=BotSocket
[BotSocket]
TypeName=BotSocket
DLL=SessionAgentSocket.dll
Server=<Server_Name_Here>
Port=1966
[Null]
TypeName=Null
DLL=SessionAgentNull.dll
PrivacyDropBotHitResponseBodies.cfg:
[Rule1]
Enabled=True
Actions=A_DropResponse
[A_DropResponse]
Action=DropResponse
That's it! There is a link here for all of these files. Edit your copies of these cfg files for the IP addresses, the location where you want the TLA files, and the RTARouteBots SA’s DownStreamConfigSection, drop them into your pipeline, restart the transport service, and watch the bots disappear from your canisters and appear in the filesystem in TLA files!
Maintenance (ugh!)
You will need to write a scheduled task that runs every night and removes BOT TLA files older than say, 14 days (whatever works for you). Type the word ‘task’ into the search box and the Windows task scheduler will appear near the top of the results – open it and setup your task here.
Maintain your browscap.csv file and your list of Stealth IP addresses regularly.
Keep an eye on Bot traffic to ensure there are no real users (orders page URLs) that appear in these files.
Analysis
You can open a .TLA archive file with RTV, and do analysis on the bot traffic that way. But it’s not very good at summarizing the data or finding the heaviest-hitters. I’ve found that a very powerful tool for analysis is to use Excel Pivot tables.If you are not a pivot table expert – Google it! There are tons of resources for learning it. Really – you will sooo thank yourself for taking the time to learn how to use these! I’m going to show you how to convert a TLA into a CSV file ready to be consumed by a pivot table.
For analysis, it is important to know that a TLA file is just a ZIP file. You can open a TLA file with 7-Zip. Each TLTSID produces a subfolder, and within each TLTSID’s subfolder are the .req files for each request. You can open the .req files with any editor, and see a very familiar REQ block.
Writing a PowerShell script to read the TLA files and produce a .CSV summary was a bit persnickety. I’ve included one in the zip package that does a simple analysis. It works on just one TLA file at a time, and it doesn’t take arguments - you’ll need to edit the first four lines to setup things like input and output file names. If I get sufficient interest and comments, I’ll consider making the script user-friendly and production–ready.
The output of the PowerShell script is a .CSV file. Open it with Excel, hit ‘ctrl-A’ to select everything, and Insert->PivotTable->PivotChart (then take the defaults by clicking on ‘OK’).
I've done a fifteen minute video capture showing the kinds of analysis that you can do on this data. The .mp4 file of this video can be found here .
Next Steps:
If anyone extends the powershell script for additional features , or comes up with additional analysis tricks for the CSV file, be sure to post and share them in the Tealeaf Users Group on LinkedIn. Ditto for any ideas you have on finding the IP addresses for Stealth bots.
I hope this technique helps you write tealeaf events that are better focused on the business needs, and also provide a good analysis of your robot traffic. Happy TeaLeaf’ing!