For a project I’m working on, I need to deal with historical daily TAQ (Trade and Quote) data for a selection of stocks. Each day’s TAQ file consists of all of the stock transactions that occurred during that trading day. The data were downloaded from the Wharton Research Data Services website, using the Yale Center for Analytical Sciences subscription.
To demonstrate, here are the first few rows of a TAQ file:
“SYMBOL” refers to the stock’s ticker symbol. “DATE” is a string comprising the year, month, and day of the transaction; it is the same in every row of a given TAQ file. “TIME” is the hour, minute, and second of the transaction, in Eastern Standard Time. “PRICE” is the amount of money exchanged per share of stock in the transaction. “SIZE” is the number of shares traded in the transaction. The other columns are unimportant for my analysis.
Splitting up Big Files
Each raw TAQ data file is too big to open all at once using, for example, read.csv. Attempting to open one on the remote machine I’m using (with 8 GB of RAM) results in memory swapping, slowing the processing to a practical standstill.
Instead, I processed the TAQ files by reading in one line at a time, which uses a negligible amount of memory. I wrote a Python script TAQprocess.py which is shown below. The code is object-oriented, which slows it down somewhat, but in this case, making the code easier to read and manage seemed worth the added computation time.
The following R code was used to run the Python script on each date of interest. It processes the files in parallel, making use of all twelve cores on my machine.
Now we have one folder for each date of interest. Within that folder, there is a file for each stock that was represented on that day’s TAQ file. This stock’s file has a row for each second of the day during which trades occurred. Each row consists of two columns: time of day (in seconds) and weighted average trade price during that second. For example,
This creates a file outstanding.csv with the number of shares outstanding on each date for each stock. The first few lines of the file are shown below.
For many stocks, I was unable to retrieve Yahoo! Finance or GetSplitHistory data, so they were thrown out. Also, any stocks that had a split during either of the periods of interest were discarded from my list of good stocks.
Finally, multiplying the number of shares outstanding by the share price tells us the market capitalization of each stock for each period. Of course, each stock does not have a single price for each period. Instead, I compute the average price.
Here is a glance at the spread of market caps.
Cleaning the Data
On some holidays the stock market is only open for a shortened trading day. If any partial trading days are present in my data set I should throw them out. I would guess that any partial trading days should have noticably less trading activity than ordinary trading days. I made a boxplot of trading activity to look for lower outliers, but there were none. As a result, I assume there are no partial trading days in my data set.
Ultimately, we want to compare SSCB stocks to non-SSCB stocks, in hopes of determining differences caused by the SSCB rules. To that end, we should try to control for trading activity. In other words, we want amount of trading activity to be about the same within the two groups. The SSCB stocks are, on average, more frequently traded. Therefore, I expect to toss out many of the obscure and infrequently traded non-SSCB stocks in order to make the two groups more similar.
I decided to discard any stock that had a day of fewer than 400 seconds of trading. The non-SSCB group still skews lower, but at least they are in the same ballpark now.