For a project I’m working on, I need to deal with historical daily TAQ (Trade and Quote) data for a selection of stocks. Each day’s TAQ file consists of all of the stock transactions that occurred during that trading day. The data were downloaded from the Wharton Research Data Services website, using the Yale Center for Analytical Sciences subscription.

To demonstrate, here are the first few rows of a TAQ file:

SYMBOL,DATE,TIME,PRICE,SIZE,CORR,COND,EX
A,20100104,9:30:02,31.32,98,0,Q,T
A,20100104,9:30:50,31.39,100,0,F,T
A,20100104,9:30:50,31.4,300,0,F,T
A,20100104,9:30:50,31.41,100,0,Q,P


“SYMBOL” refers to the stock’s ticker symbol. “DATE” is a string comprising the year, month, and day of the transaction; it is the same in every row of a given TAQ file. “TIME” is the hour, minute, and second of the transaction, in Eastern Standard Time. “PRICE” is the amount of money exchanged per share of stock in the transaction. “SIZE” is the number of shares traded in the transaction. The other columns are unimportant for my analysis.

## Splitting up Big Files

Each raw TAQ data file is too big to open all at once using, for example, read.csv. Attempting to open one on the remote machine I’m using (with 8 GB of RAM) results in memory swapping, slowing the processing to a practical standstill.

Instead, I processed the TAQ files by reading in one line at a time, which uses a negligible amount of memory. I wrote a Python script TAQprocess.py which is shown below. The code is object-oriented, which slows it down somewhat, but in this case, making the code easier to read and manage seemed worth the added computation time.

The following R code was used to run the Python script on each date of interest. It processes the files in parallel, making use of all twelve cores on my machine.

Now we have one folder for each date of interest. Within that folder, there is a file for each stock that was represented on that day’s TAQ file. This stock’s file has a row for each second of the day during which trades occurred. Each row consists of two columns: time of day (in seconds) and weighted average trade price during that second. For example,

time,price
34201,52.41
34205,52.4
34210,52.41
34222,52.4


## Market Capitalization Data

Next, I made a list of all of the stocks that are represented on every date of interest and stored it as a file called goodstocks.txt.

For this list of stocks, I ran a script to scrape the web and determine the number of shares outstanding for each stock during the dates of interest. Note that the outstanding and append.csv functions are detailed in a previous post.

This creates a file outstanding.csv with the number of shares outstanding on each date for each stock. The first few lines of the file are shown below.

,20100517,20100610,20110804,20111013
A,347930000,347930000,347930000,347930000
AA,1.07e+09,1.07e+09,1.07e+09,1.07e+09
AACC,30770000,30770000,30770000,30770000
AAON,24520000,24520000,24520000,24520000


For many stocks, I was unable to retrieve Yahoo! Finance or GetSplitHistory data, so they were thrown out. Also, any stocks that had a split during either of the periods of interest were discarded from my list of good stocks.

Finally, multiplying the number of shares outstanding by the share price tells us the market capitalization of each stock for each period. Of course, each stock does not have a single price for each period. Instead, I compute the average price.

Here is a glance at the spread of market caps.

## Cleaning the Data

On some holidays the stock market is only open for a shortened trading day. If any partial trading days are present in my data set I should throw them out. I would guess that any partial trading days should have noticably less trading activity than ordinary trading days. I made a boxplot of trading activity to look for lower outliers, but there were none. As a result, I assume there are no partial trading days in my data set.

Ultimately, we want to compare SSCB stocks to non-SSCB stocks, in hopes of determining differences caused by the SSCB rules. To that end, we should try to control for trading activity. In other words, we want amount of trading activity to be about the same within the two groups. The SSCB stocks are, on average, more frequently traded. Therefore, I expect to toss out many of the obscure and infrequently traded non-SSCB stocks in order to make the two groups more similar.

I decided to discard any stock that had a day of fewer than 400 seconds of trading. The non-SSCB group still skews lower, but at least they are in the same ballpark now.

How many stocks are left for the analysis?

## Estimating Volatility Profiles

Now that the stock data has been simplified, cleaned, and organized into manageable chunks, we can estimate volatility profiles as described in a prior post.

Here’s a typical example of a squared volatility profile.