Monday, March 31, 2014

Privacy Rules and blocking PCI data

I recently got a comment asking for clarification of the tealeaf privacy rules. These are the critically important part of Tealeaf  data processing that eliminates or masks Personal Confidential Information (PCI), things like credit card numbers and passwords. I looked at the IBM FAQs, and was very surprised to learn there was very little information present. Years ago I wrote a technical explanation how to block PCI data in a value attribute and posted it on the tealeaf community site, but it appears that post did not get from viaTeaLeaf into the IBM site, so I’ll re-write and expand upon it here.

Where PCI Blocking occurs

There are three different places in the tealeaf systems where you need to make configuration changes to effectively block PCI data; the hit’s request block, it’s response block, and in the client-side Client User Interface (CUI) recorded data hit. Blocking PCI data in the request and response is accomplished in the tealeaf pipeline. The CUI data blocking is done in the CUI/SDK configuration file.  In the pipeline, the privacy session agents (Privacy and PrivacyEx) can block or mask PCI data. These two terms are significantly different to the information security teams. Blocking PCI data means it is destroyed in the data stream – there is no way to recover it. Masking means to encrypt the PCI data in such a way that only authorized users can see it. Blocking is easier to implement, and I’ll use the term ‘block’ through out most of this post. Masking the data requires more implementation steps, and I’ll devote a section to that later in the post.

I always urge clients to implement all the PCI data blocking in the pipeline at the PCA tier – doing it here keeps PCI data off all the downstream servers, and only the PCA servers have to be made PCI compliant and audited by the information security teams at their company

PCI Blocking in the Pipeline

Privacy rules are implemented in the privacy agents, and can be used for much more than just blocking or masking PCI data. Privacy rules are usually doing some kind of search and replace/extract operations. This blog post is going to focus narrowly on just their use in PCI data. Privacy rules aren’t really very hard – they are just search and replace patterns. They are implemented in the privacy session agent of the the tealeaf data pipeline, and can be put into the pipeline at any tier – PCA, HBR, or Processing server. But for PCI blocking, I strongly urge this be done in the PCA.

The privacy session agent reads the privacy.cfg file for its search and replace patterns. Since the PCA servers are Linux boxes and the HBR and Processing servers are Windows boxes, the path names to the privacy files will of course be different. But the contents of the file are identical. In its simplest format, it is a list of [rules]. The PCA has a visual GUI for editing the privacy rules on the PCA. It’s instructive to try different privacy rule formulations in the GUI and see how it affects the privacy.cfg file, but for this post, I’ll be old-school and focus on just the contents of privacy.cfg. You can edit this file with any text editor, and you can put this file under source code control to track version changes.

In a few weeks I hope to post details on how to protect this file from changes, to reduce the possibility that an unscrupulous system admin makes changes to allow harvesting of PCI information.

Blocking the data in the Request block

The first and easiest place you need to block data is in the request block of the hit. Your web application is going to put up a form, and there are going to be <input> tags that define text fields where the users enter passwords, credit card numbers, CVV numbers, new passwords, answers to security questions, old passwords/new passwords, and other PCI data fields. Every HTML input tag has either a name= or an id=attribute in the tag. When the page posts (or ‘gets’), the input data, along with the name or id is passed as either a query parameter or as part of the request body. For privacy, you don’t have to care if the page ‘posts’ or ‘Gets’, tealeaf auto magically blocks both ‘post’-ed or ‘get’-ed data. To block specific data fields in the request blocks, you specify a list of input field names (or ids). It just takes one rule and one action block. The rule is always enabled, it specifies an action block, and that action block specifies the action ‘Block’, the Section ‘urlfield’  and the ValueName = ‘the comma separated list of fields to block’. Together, they look like this:

[Rule1]
Enabled=True
Actions=A_TextBlockURLFields

[A_TextBlockURLFields]
Action=Block
Section=urlfield
ValueName=CreditCard|CardNumber|NewPassword|SecurityAnswer1

I’ve managed ValueName lists that are pretty long. Often the web application is developed with .Net or other frameworks, and the framework assigns the control names. .Net applications in particular have very long control names, like ctl00$ContentInfo$CreditCard$CreditCardNumber$txtCCNum, and ctl00$ContentInfo$SecurityQuestionAnswer$AnswerReType$txtSecurity. You do have to fully specify (no wild cards) the full name of every field to block. 

If the web application has accrued lots of pages and lots of PCI-impacting fields from lots of developers over the years, managing the list of PCI-impacting field names can be a pain. But a little creative Excel spread sheeting makes it possible to manage even very long lists of field names to block. I’ve a couple of Excel formulas that can help – comment on this post if you are interested.

Blocking the data in the Response block

The next place to block PCI data is in the response block. This takes a bit more work, and you have to know your application. In particular, where does your application display credit card numbers, passwords, answers to security questions, etc? Usually, there a far fewer places where your application echoes PCI data back to the browser. But you will need to look at example pages where, for example, a credit card number is displayed, or a list of security questions/answers are displayed. What you are going to need to do, is write (a) rule(s) that can match a preamble – sensitive data – postamble. The privacy rules support most Regular Expression constructs, so its pretty easy to write the rules, once you have found some example pages. You don’t have to match every possible combination in one expression, it’s fine to have multiple actions, each matching one or more places where PCI data may appear in the response. Nor do you need to specify the page name (URL) where the PCI data lives – in fact, I never specify a URL when blocking sensitive data – it’s too easy for a URL to get changed. I always write the rules to look for the data patterns, and let the PCA processors look at every page that comes through, looking for the patterns. As long as the PCA CPUs are not breaking a sweat, there's no problem with letting them inspect every page. Modern servers have plenty of CPU cycles – just keep an eye on all the tealeaf pipeline processes on a PCA during a typical peak-busy-hour, and keep their CPU utilizations under 60% or so, just to be safe.

Back to the regular expressions for data blocking… Earlier we had a rule and an action for blocking the request data. We can simply add another action to the rule for blocking the response data. The action will execute on every hit. Again, as long as the PCA’s CPUs are not heavily loaded, there’s no problem with that. What this particular rule is going to do is to block any value attribute of an HTML input tag. Have you ever entered a credit card number on a page, submitted it, had a mistake somewhere on that page and the web site helpfully echoes the credit card number back in it’s input field? This is usually accomplished with the value attribute, so blocking the value attribute prevents tealeaf from recording the PCI data that the web site echoes back in the input field.

The new rule now has two additional comma-separated action

[Rule1]
Enabled=True
Actions=A_TextBlockURLFields , A_BlockCCInResponse1, A_BlockCCInResponse2

And the privacy.cfg file has two added action blocks.

[A_BlockCCInResponse1]

Action=Block
Section=response
Field=body
StartPatternRE=(?-s)<[^>]*?\sname\s*=\s*(["']{0,1})(CreditCard|CardNumber|NewPassword|SecurityAnswer1)\1\s+[^>]*?value\s*=\s*(["']{0,1}).*?\3[\/\s>]
Inclusive=True
BlockingMask=value\s*=\s*["']{0,1}([^"']*)["']{0,1}[\/\s>]

The tealeaf manual section on privacy rules explains the Action, Section, Field, StartPatternRE, Inclusive and BlockingMask parameters, so we will just focus on the StartPatternRE and the BlockingMask values.

Regular Expressions are your friend!

Regular Expressions, or RegExs, tell a computer how to match a pattern. RegExs are a computer’s native language, and can be made very efficient. There are plenty of tutorials on the web for constructing regular expressions, so I’m going to just explain the RegEx above that will mask the value attribute.

The string in the response to be found will look like <input OptionalStuffWeCanIngore name = “FieldName” OptionalStuffWeCanIngore  value = ‘123456789123456’ OptionalStuffWeCanIngore >  and the spaces around the = characters are optional, and the string may use single quotes or apostrophes. Developers (and development tools) are free to construct their HTML any way they like, as long as it conforms to the W3C standard, so our RegEx needs to be sophisticated enough to match any standard formulation

First, the alternation construct is (a|b|c) which says to match “a or b or c”. Our RegEx uses an alternation construct to list all of the input field names, like (password|newpassword|oldpassword|creditcardnumber|cvv). Alternation constructs will by default create a match group and record which alternative occurred. Alternation constructs are cheap in terms of processing powers, but match groups are much more expensive. We can tell a RegEx not to create a match group for an alternation by placing (?-s) before the alternation. The RegEx will not create any match groups for any alternation in the RegEx until it encounters (?+s) in the RegEx pattern.

Next we have character classes, [“’], which says to match either the single quote or the apostrophe character, and the length modifier {0,1}, which says to match the anything in the character class 0 or 1 times. Together, [‘”]{0.1} says to match either 0 or 1 quote or apostrophe character. Another character class we use is [^>] which says to match any character except the >, and two more length modifiers we use is the *, which says to match the preceding character 0 or more times, and the + which says to match the preceding character 1 or more times.

Any character is matched by the . character. Whitespace (spaces or tabs) is matched by the special sequence \s. Greedy matching is the default for a RegEx, which means that if you have the string “a b c d e f”"”, and you say a\s*, it will match the longest substring – “a b c d e “. If you want to match the shortest substring, add the ? character after the  *, so a\s*? will match “a “.

To match the character / itself, it needs to be “escaped”, so \/ matches the / character.

Our final construct is the backreference. Within a RegEx, we can refer to something that matched earlier within the RegEx, if you put what you want to reference in () grouping. This does not create a match group, so it is not ‘expensive”. Whatever matches within the first pair of () is referenced later in the RegEx as \1; the second pair of () is matched with \2, etc. So when the string has a pair of single quotes around a value, or a pair of apostrophes around a value, the W3C standard says the same character (quote or apostrophe) has to be the beginning and end.  (["']{0,1})(a|b\c)\1 says to match a or b or c, only if it has no quote pair or apostrophe pair surrounding (that’s the 0) , or if a or b or c has a pair of quote characters or a pair of apostrophe characters surrounding it.

Putting together some these into short substrings, \s+[^>]*? says to match a whitespace character occurring one or more time, then any character that is not the > occurring 0 or more times (non greedy).

Here is the StartPatternRE, and an explanation of it.

(?-s)<[^>]*?\sname\s*=\s*(["']{0,1})(CreditCard|CardNumber|NewPassword|SecurityAnswer1)\1\s+[^>]*?value\s*=\s*(["']{0,1}).*?\3[\/\s>]

(?-s)<[^>]*?\sname : Don’t create any match groups in this RegEx. Start matching at a < character, ignore anything until “whitespace name”, and if a > appears before the string name, stop trying to match

name\s*=\s* :look for the string name, then 0 or more whitespace, the = character, and zero or more whitespace.

(["']{0,1}) : look for either the single quote character or the apostrophe character, occurring exactly 0 or 1 time. Group this to create a backreference. Since this is the first pair of () characters, this backreference can be referred to a \1.

(CreditCard|CardNumber|NewPassword|SecurityAnswer1) : This is a sample of the alternation that lists the exact field names of your application that need to be blocked. Only fields in which the application will echo PCI data using a value attribute need to be listed, separated by the vertical pipe | character.

\1\s+[^>]*?value: : Match whatever the first backreference matches, then one or more whitespace, then any string of characters that is not the > character (non-greedy, that is, the shortest substring) followed by the string value.

\s*=\s*(["']{0,1}) : 0 or more whitespace, the = character, then zero or more whitespace, then either the single quote character or the apostrophe character, occurring exactly 0 or 1 time. Group this to create a backreference. Since this is the third pair of () characters, this backreference can be referred to a \3.

.*?\3[\/\s>] : Any character occurring 0 or more times (non-greedy), then the third backreference, After the third backreference matches, it must be followed by a whitespace character, or the > character, or the / character. Closing a tag in HTML can be either > or />, so after the value attribute, the input tag either continues with more attributes (the whitespace character will follow the backreference), or the input HTML tag will close, so we need to match either > or /.

Whew! That was a lot of explanation. I hope you followed all of that, but if not, there are tools to help you visualize how all of this works.

Javascript Regular Expression Engines and Testing tools

There are web sites and online tools to help you construct Regular Expressi0ons, and test them, You paste in the string to be tested ( a cut’n’paste of a web page snippet that contains the input tag with value attribute), and the RegEx, run the test, and the tool will tell you if the RegEx matched, Good tools even break down each piece of the RegEx and tell you where it matches the string. But be careful – online tools use different RegEx engines! you need to carefully validate the online tool you use produces the same result as the PCA.

Here are my two favorite online tools for testing a  RegEx.

http://regexpal.com/  This one is good,  but basic.

http://myregextester.com/  My personal favorite. In particular, it has an ‘explain’ function that breaks down your RegEx piece by piece and shows you  why/how it matches. But watch out – the tool doesn’t work well in the Chrome explorer. I use IE when I’m on this site.

These test tools are especially good for testing RegExs you might use in an Advanced Mode event. The Event Processing engine is written in JavaScript, and uses the Google JavaScript engine. Both of the tools above use a JavaScript engine (I have no idea which engine). Make sure your testing tool, whatever you select, uses a JavaScript engine, because there are a few differences between the .Net engine, the Perl Compiled Regular Expression (PCRE) engine, and JavaScript engines. The biggest difference to watch out for is use of the ‘.’ character in multi-line matches. If you want to match ‘any character’ and the substring crosses a line boundary (CR,LF character pair), the .* construct won’t work in JavaScript. But [/S/s]* will work. If you don’t follow that after studying the RegEx rules, post a comment, I’ll go into detail if anybody asks…

The Blocking Mask

The section above described the StartPatternRE portion of the rule. This identifies and isolates the input field that contains a specific name attribute and contains a value attribute. But we have not yet told the Action what to block. We use the Blocking Mask for that.Whatever appears in the first grouping () set of parentheses will be replaced with the StrikeCharacter . The default StrikeCharacter is ‘X’

BlockingMask=value\s*=\s*["']{0,1}([^"']*)["']{0,1}[\/\s>]

With the blocking mask as specified above, the response block will be modified to become value = “XXXXXXXXXX”

The second action for blocking a value attribute

Earlier we discussed adding two actions to our rules, both A_BlockCCInResponse1 and  A_BlockCCInResponse2. Why a second action? Because we don’t know if the developer (or framework) will put the name attribute first in the input tag, or the value attribute. So we need a variation on our StartPatternRE. The second action is identical to our first action, with the only difference being in the StartPatternRE.

StartPatternRE=(?-s)<[^>]*?\svalue\s*=\s*(["']{0,1}).*?\1\s+[^>]*?name\s*=\s*(["']{0,1})(CreditCard|CardNumber|NewPassword|SecurityAnswer1)\3[\/\s>]

In our second action, everything else, even the BlockingMAsk, does not get changed.

Blocking Data in the Response block that is not a value attribute

The section above discussed how to block a value attribute in an input tag. Occasionally you might find PCI data echoed in the body of the response. Examples I’ve seen are sites that echo back both a bank routing number and an account number, sites that display stored credit card numbers for editing, and similar examples. When you have a site that does this, you need to find an example, cut the HTML code around the PCI data, paste it into a text editor, then modify the real PCI data to be something ‘fake’.Then use the online testing tools, and develop a RegEx that will block the PCI portion of the data.

Blocking Data in the CUI/SDK library

The CUI/SDK library is JavaScript that runs in the user’s browser, records DOM events on the page (it only sees your web application, nothing else) , and send information about the DOM event back to the Tealeaf system. Most customers want to record the keystrokes the user enters, so they can tell if the page design causes problems in data entry, and correct it if so. So, we need to make sure that PCI data is NOT recorded and sent back to tealeaf . Yes, we could write blocking rules to block the data when it is received, but itis much easier to properly configure the CUI/SDK library to exclude certain fields. In the CUI/SDDK, we can use wildcards to match the field names.

AS of Version 8.7, the file you need to look at is the tealeaf.js file. The location may change in the future, so you should search for the string tlFieldBlock. You will find a nested structure by this name, and one of the structure’s members will be “name” :  followed by a pipe-delimited | string. The string will have the names of the input fields to block. The list of names should match the same list you have in the request and response blocking rules. An example is

tlFieldBlock: [

{"name": "CreditCard|CardNumber|NewPassword|SecurityAnswer1", "caseinsensitive": true, "exclude": false, "mask": function () { return TeaLeaf.Client.PreserveMask.apply(this, arguments); } }

],

One of the very nice tings about the CUI/SDK blocking is that the list of alternative field names is a true regular expression. password will match password, oldpasssword, newpassword, passwordchanged, etc. You can use the ^ at the start of an alternative, or $ at the end of an alternative, to anchor the alternative string to the start or end of the field name (Go look at the RegEx help online at the testing tools if you don’t know what start and ends anchors are all about).

The best way I’ve ever seen this managed, one customer had a fellow whose job included making fixes to the web site. A really sharp guy, he was not a member of development group, but of operations, and his job was to make fixes to the web site when something was broke, and send those fixes back to the development team for incorporation into subsequent releases. This person was given the responsibility to add the tealeaf CUI/SDK libraries to the web pages. As is usually the case, it was done within a page template (and in fact, was incorporated as part of a tag/library management solution). But since he could make changes to the web pages, he made sure all PCI input fields had clear names which included password or creditcard or cvv or similar. Then it was very easy for him to make sure the CUI/SDK blocking rule names matched these input field names. I hope that your company makes a similar decision, and lets the CUI/SDK implementer change input field names when necessary. It will certainly reduce maintenance costs!

Blocking data in XML Web Services and “one page” designs

Blocking PCI data in a XML web service or ina JSON update to a page’s DOM is no different than blocking data in a classic web application. We are still dealing with Request and Response blocks that make up a hit. During development of the PCI privacy rules, find examples of the tealeaf sessions having hits from the web service or one-page JSON DOM update, inspect the request and response block with the Replay tool (RTV or Replay server), write and test RegExs to block the data. If the  XML service is answering requests on just a very small number of URLs, you should consider making the Rule look at just those specific URLs, to keep the rule efficient. No use inspecting other pages for patterns that will never exist, nyeh?

Encrypting data instead of blocking it

Instead of blocking PCI data, the data can be encrypted so that only authorized users are able to see the encrypted data in clear-text. Securing the access to the clear-text is accomplished by using specific Active Directory security groups. Details how to accomplish this are provided in the IBM Tealeaf CX Configuration Manual, in the sub-section Encrypting Data Filter under the section Privacy Session Agent. Please refer to this document for details. A summary of the steps is as follows:

  • Create an AD security group and populate it with authorized users
  • Use TMS to configure the Search Server Authentication
    • Add the AD security group to the Search Server Authentication dialog
    • Create a privacy key and assign it to the AD Security Group. Copy the key to the clipboard
  • Edit the privacy.cfg file
    • Add the privacy key to the bottom of the file in the [Keys] section with a new identifier, e.g. [Key03]
    • in the privacy blocking action for the fields you want to encrypt, change the Action from Block to Encrypt
    • Add the line Key=Key03 to the Action block

With this, an authorized user will see the clear-text value of a field during replay. Field values are stored in the Canister using the encrypted value, and indexed using the encrypted value, so searching for a clear-text value will not work.

Maintaining the blocking rules

One of the unfortunate facts of life in a tealeaf administrators job is that the web site will change over time. New PCI fields will appear in the application. the Tealeaf privacy rules will need to be maintained and updated to account for these changes. The hardest part of maintenance is getting notified that PCI-impacting changes are part of a new release. Here’s a partial list of some methods I’ve seen used to keep up to speed on this:

  • Attend new feature design and release meetings. Try to make sure somebody from the tealeaf admin team attends meetings where new features are being discussed, and they keep an eye (or ear) out for changes that discuss new credit card features or new password features. If a new PCI impacting feature is discussed, make sure the tealeaf team sees QA/Staging versions of the application, and implements new PCI rules BEFORE the application is released to production.
  • Setup a tealeaf event that looks for 15+ consecutive digits. This event will trigger if a credit card number comes through in clear-text. Of course depending on the web application, it may also trigger on any other string of 15+ characters, but tealeaf admins for the site should become familiar with the ‘normal’ volume, and keeping an eye on this event may catch new places where CCNums appear.
  • Regularly (weekly or bi-weekly) do a full-text search the Canister data for the strings password, passwd, passwrd, and even pass?*. Review the matches to see if new password fields have appeared.
  • Publish a regular (monthly) list of PCI-impacting fields you are blocking, and e-mail these to development teams or their management. Reminding these teams on a regular basis exactly what is PCI-impacting can help keep developers, new and experienced both, cognizant of the fact that they need to communicate PCI-impacting changes to the tealeaf team.

Conclusion

Blocking or encrypting PCI data to prevent it from being available to unauthorized tealeaf users is one of the most important tasks of the tealeaf admin staff. I hope you find this blog post useful for blocking the confidential information in your tealeaf capture. Please post feedback, errata or further questions in the comments. Happy TeaLeaf-ing!

Saturday, March 22, 2014

Tealeaf Ecosystem High-Level Block Diagram and Description

The following is a very high level block diagram and short description of ALL the components that make up a full tealeaf system, including the optional components. I’ve written this post as the way I implement systems, and other configurations are possible. For example, PCI blocking can be done at the HBR level or the Processing/Canister level, but doing so means more servers must be made PCI compliant, costing more money and carrying more risk. In another example, I’ve heard that few other implementation consultants bother to implement the Statistics database, but the effort is minimal and the data valuable, so I’ve never understood why not. Since this is a high-level description of the tealeaf components, I’ve chosen to describe them in declarative style without trying to enumerate all the possible options for configuring each component.
ATAP Tealeaf Architecture High-Level Block Diagram

[Edit 06/05/2014: A PDF version of this drawing can be found here.]

Below are descriptions of each block.

Network Components

This block encompasses all of the non-tealeaf servers that are involved with the duplication and transmission of the data packets.

Packet Duplication

Any network component (Tap, Switch Span, Load Balancer) that performs the actual duplication of TCP/IP packets. SSL decryption may be performed at this layer, or lower at the Passive Capture Appliances (PCAs) layer

Packet Transmission

The network component (direct-connect crossover cable, switch, Gigamon) that connects the duplicated packet stream to the Passive Capture Appliances (PCAs). These devices may also connect the duplicated packet stream to other devices, such as intrusion-detection systems that need the data.

Passive Capture Appliances (PCAs)

Redundant Linux servers connected to the duplicated packet stream that reassemble the packets into request blocks and response blocks, and these into hits.  SSL decryption may be performed here, or higher at the Packet Duplication layer. PCI data blocking and/or masking should all be performed at this layer. The TLTSID sessionization value will be inserted here if not already present in a HTTP cookie from the request or response block. PCA servers should be treated as PCI critical components. Below this layer, no PCI data is present. PCA servers are managed via SSH and/or a tealeaf web console (GUI) that may only be accessed by tealeaf admins from the Health-Based Routers (HBRs).

Health-Based Routers (HBRs)

Redundant Windows servers connected to the PCAs whose primary purpose is to distribute the traffic stream to multiple Processing/Canister servers. Distribution is session-sticky (using the TLTSID cookie), and normally done with a statistical even-distribution algorithm to send roughly the same number of sessions to each Processing server.
Overall, this diagram is for a production tealeaf system, but a good corporate tealeaf implementation includes both development and QA tealeaf systems as well (much smaller of course). Developers of tealeaf events need a small stream of production data sent to the development tealeaf system in order for developers to have data against which to create new events. The HBR servers include the capability to extract specific sessions or a statistical random fraction of sessions from the production stream and send those sessions to the development tealeaf system, as indicated in the diagram.
The HBR servers monitor all of the Processing/Canister servers, and if any Processing server stops responding, the HBR servers take that Processing server out of rotation and re-distributes the traffic to other servers. If a Processing server comes back alive, the HBR servers begins sending traffic back to that Processing server. Processing servers are cycled and self-checked every night, at different times, and the HBR routers must take each Processing server out of rotation, allow sufficient time for most sessions on that server to end, recognize the Processing server has stopped responding (while it self-checks) and recognizes when it comes back on-line, and begin sending it data again.
HBR servers do no data storage, but operate on the data stream. Examples include robot identification (User-Agent or IP based); deletion of hits based on IP address or URL or any combination request or response patterns; rewrite the Remote IP address using the latest HTTP_X_FORWARDED_FOR value; copying cookie values like a SID to the appdata section for indexing; condensing the referrer domain to provide meaningful referrers; extracting price and currency information from a page; normalizing page URLS to remove locale country codes and many other operations.

Processing/Canister servers

Redundant Windows servers connected to the HBRs that provide the storage location for hits and process these hits looking for patterns. The duration that sessions persist for replay and analysis is a function of traffic density and the amount of disk space the Canister servers provide. Sessions are extracted from the canisters on demand to replay in the Replay Server or RealiTea viewer.
Processing of hits is shorthand verbiage for the complex pattern recognition performed by the tealeaf Events system. Pattern recognition is done against individual hits, against session metadata, and by combinatorial logic looking at all events that have occurred in the entire session. In addition to looking for patterns, match groups can be defined to extract substrings from the data, and the number of times each substring occurs can be recorded. Metadata such as the “time-into” a session that a pattern occurs can be extracted. Extracted data can be grouped into sets. Event processing and data extraction performed in the Processing/Canister servers can be very complex.
If the optional cxConnect Real-time data extractor is installed, the Processing servers create messages for each event configured for real-time extraction, along with information on each hit the event occurs on, and these messages are delivered via a tealeaf pipeline to the cxConnect server.

Reporting Server

Non-redundant Windows server connected to the Processing/Canister servers that poll the Processing servers for their traffic and event counts, and provides the primary web-based GUI interface for users to see reports on this data. Traffic and Event data is collected from the processing servers, and stored in the Tealeaf SQL Reporting database. Dimension aggregates are calculated and stored in the database during each collection run. Daily, Weekly and Monthly aggregates are also calculated periodically. The Reporting GUI provides an interface to create and view reports on this data. Every tealeaf PCA, HBR, and Processing server reports it’s raw traffic statistics to the Reporting server, which stores this information in the Tealeaf SQL Statistical database, and the information is available in the Reporting GUI.

Tealeaf SQL Database server

Non-redundant Windows server running Microsoft SQL Server, or a corporate server running Microsoft SQL server. Tealeaf creates and uses three SQL databases (System, Reporting, and Statistics). The SQL server is usually managed by the corporate DBA team, and not the tealeaf admins. The tealeaf admin team works closely with the DBAs to install, update, maintain, and backup these databases.

Replay Server

Usually a service running on the reporting Server, the Replay server may be configured as a stand-alone Windows server. The Replay server is used to create visual replay of user sessions from the data stream of hits (request/response pairs) stored in the Processing/Canister servers. The Replay is web-based, and does not require any executable to be installed on a tealeaf user’s computer. See also the RealiTea Viewer section for an alternative method of replay.

cxOverstat

A feature that can be turned on that allows the Replay server to display the heatmaps and other functions provided by the cxOverstat feature.

Archive Servers

Optional Windows server(s) whose purpose is to extract a subset of sessions from all Canister servers that meet a selection criteria (most often, a purchase session or a trade session), and store these session for a period of much greater duration that the Canister servers. For example, if Canister servers are ALL storing sessions for 30 days, the Archive servers may store purchase sessions for two years. These can also be configured as “non-tamperable”, which provides a hash-based mechanism to prove that a session replayed from the Archive server is the same session hat was originally captured from  the web site.

TLI servers

Optional Windows server(s) that store and make available for replay certain static content of the web site. During replay, the images and JS files are loaded from the web site. With a TLI server, these files are stored each day, and during replay of a session for say, three weeks ago, the images and JS files as they were three weeks ago, are used in the replay. This provides better fidelity of replay that is closer to the actual historical session. The drawback is the amount of storage needed to keep historical static content.

cxConnect Servers

Optional Windows server(s) whose purpose is to provide the interface that extracts data from the tealeaf Processing/Canister servers and makes that data available to external systems. There are two separate data feeds available. The real-time data feed is a set of (configurable) selected event messages with parameters that are sent by the Processing/Canister servers to the cxConnect servers using a tealeaf transport pipeline. The cxConnect servers have two distribution choices for this data – log to file and/or send to to a TCP/IP listener on an external system. The real-time feed is most often used to feed a Complex Event Processing (CEP) listener system. In turn, these systems drive real-time decisioning systems that modify the web application’s responses to the user based on their past actions in the session. The other data feed available from the cxConnect systems is a scheduled (typically hourly or daily) batch extract of detailed information regarding users, sessions, hits, parameters, events, and dimensions. The information is stored in flat files that conform to the Microsoft SQL Bulk Load (BCP) format. tealeaf provides an example schema for creating a relational SQL database, and script jobs for loading the corresponding BCP files into these tables. This is the reference mechanism provided by tealeaf for putting the data into a fully relational database.
The following three pieces of tealeaf code are implemented on the web sites and native mobile applications

Cookie Injector

Very small piece of code running on the web servers that adds the three tealeaf cookies. The TLTSID is a non-persistent cookie whose value does not change as long as the browser window remains open. The TLTHID cookie is a unique identifier assigned to each hit. The TLTUID is a persistent cookie left on the browser, whose value is sent each time the user revisits the site. All three cookies are 32-character GUIDs.

Client-Side Recording (CUI/SDK) Library

JavaScript (JS) library added to the web site and called from the web pages. This library implements the recording of the DOM events on the web page, and transmits the page’s DOM event information (mouse actions, keyboard actions, page rendering time, etc.) back to the tealeaf system. Includes a target page to be added to the web site that the library will call back to. The CUI/SDK is a very important piece to implement properly for sites that use anything similar to the “one-page” technique. The implementation of the CUI/SDK provides a much higher fidelity replay of the session.

Mobile Library

JavaScript (JS) library added to the Mobile web site and to the mobile device native application (IOS or Android) called by the device.This library implements the recording of the user’s interactions with the device and the page. Screen click, key-press, swipe, pinch, rotation and other page interactions are recorded and transmitted back to the tealeaf system. The implementation of the CUI/SDK provides a much higher fidelity replay of the session.
The following executable is installed onto the tealeaf user’s computer for replay of sessions

RealiTea Viewer

This is an executable program that can be installed onto a tealeaf user’s computer to allow for replay of user sessions. In addition to just replay, it allows multiple sessions to be downloaded to the user’s computer, and provides searching and analysis capabilities for patterns across these downloaded sessions. It includes the ability to customize the view panes and data fields displayed for both hits and for session metadata.
The following systems are not part of the tealeaf ecosystem, but if they exist in the company, they may be fed with data from the optional cxConnect servers.

SQL Relational Database Server for tealeaf events

Non-redundant Windows server running Microsoft SQL Server, or a corporate server running Microsoft SQL server. Tealeaf provides a reference schema for a relational database that links together sessions, hits, query parameters, events, and dimensions. The cxConnect data extractor populates these tables. These tables provide a very rich source for analytics against the user behaviors.

Real-Time decisioning systems

A system that modifies the web page contents based on the user’s past actions. Usually some kind of Complex Event Processing system tied into the web servers.
The following software constructs are resident in the corporate Active Directory structures.

Active Directory Security groups for tealeaf

At a minimum there are two AD security groups. One enumerates the userids which are allowed to use the tealeaf reporting GUI and allowed to access the session data stored in the Processing/Canister servers for replay. The other enumerates the userids given access to tealeaf at an administrative level. These may be Global groups or Domain-local, meaning that userids from multiple forests are supported if desired. Additional AD security groups may be created for teams such as fraud investigation teams, which are given access to encrypted PCI data, should the system be configured to encrypt certain fields instead of blocking them. Only member of these specific AD groups will be able to see the PCI data in clear text. Normal tealeaf users and tealeaf administrators see encrypted gibberish. PCI fields that are blocked instead of encrypted are always replaced with ‘X’ for all users.

Conclusion

I hope this overview is useful in understanding at a very high level all the components of the tealeaf ecosystem. As with all large complex computer systems, the features and components are evolving, so this post will eventually become obsolete. However, as of tealeaf version 8.8 in the spring of 2014, this should be a pretty complete picture
Feedback, comments, and questions are always welcome. Happy Tealeaf-ing!