Fun with Oracle and the Dow Jones
By Ivan Pavlov [ivan dot pavlov at gmail]
PART 1
I started learning Oracle two days before I started wrote this article. So if you find that some of the things below are not done in the best way, you can leave a comment. Before starting you will need of course an Oracle server. You can get the free version -
Oracle XE. You might also want to install
Oracle SQL Developer. A good knowledge of general SQL is also necessary.
The question I would like to explore today is can we use Oracle to tell us when the financial crisis started? First, let's get the Dow Jones daily data at
http://ichart.finance.yahoo.com/table.csv?s=^DJI&a=09&b=1&c=20017&d=11&e=3&f=2008&g=d&ignore=.csv
and save them in a text file called dow.csv (stripping the header line first). They cover a period of a little bit more than an year. We will now proceed to load the data in an Oracle table. Lets first create an external directory (as system user) and grant some rights to it:
CREATE OR REPLACE DIRECTORY
xtern_data_dir AS '[/path/where/dow.csv/resides]';
GRANT READ, WRITE
ON DIRECTORY xtern_data_dir TO [user];
Then we need to create the external table based on the columns of the csv file.
CREATE TABLE XTERN_DOW (
"PDATE" VARCHAR2(32),
"POPEN" NUMBER(10,2) ,
"PHIGH" NUMBER(10,2) ,
"PLOW" NUMBER(10,2) ,
"PCLOSE" NUMBER(10,2) ,
"PVOL" NUMBER ,
"PADJ" NUMBER(10,2)
)
ORGANIZATION EXTERNAL (
TYPE ORACLE_LOADER
DEFAULT DIRECTORY "XTERN_DATA_DIR"
ACCESS PARAMETERS ( records delimited BY newline fields terminated BY ',' )
LOCATION ( "XTERN_DATA_DIR":' dow.csv' )
);
At this stage no data has been loaded and no verification of the structure has been performed. However if we try to SELECT * FROM XTERN_DOW we will see that the data are read correctly. Now lets extract the data in a normal table:
CREATE TABLE DOW (
"PDATE" DATE,
"POPEN" NUMBER(10,2), --opening price
"PHIGH" NUMBER(10,2), --daily high
"PLOW" NUMBER(10,2), --daily low
"PCLOSE" NUMBER(10,2), --daily close
"PVOL" NUMBER, --daily volume
"PADJ" NUMBER(10,2), --daily adjusted price
CONSTRAINT "PK_DOW" PRIMARY KEY ("PDATE")
);
We will need to transform the PDATE field from VARCHAR2 to DATE type before loading the data.
INSERT INTO DOW
SELECT TO_DATE(PDATE,'YYYY-MM-DD'),
POPEN ,
PHIGH ,
PLOW ,
PCLOSE ,
PVOL ,
PADJ
FROM XTERN_DOW;
We will not need all of these data. Actually we will use only the closing price, but we could use the high and low prices for more in-depth analysis. You can drop the PADJ column for example (or
SET UNUSED). Now we are ready to proceed with the interesting part of this exercise. First we will determine the weekly minimum, maximum and average closing prices. We take averages for 7 days.
SELECT
i.pdate,
i.pclose,
MIN(i.pclose) OVER( ORDER BY i.pdate RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) minp,
MAX(i.pclose) OVER( ORDER BY i.pdate RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) maxp,
ROUND(AVG(i.pclose) OVER( ORDER BY i.pdate RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW),2) avgp
FROM dow i
ORDER BY i.pdate
We can actually make this a view, a
materialized view in fact. Just prepend to the previous statement the following
CREATE MATERIALIZED VIEW WEEKLY_DOW ("PDATE", "PCLOSE", "MINP", "MAXP", "AVGP")
REFRESH ON DEMAND
WITH PRIMARY KEY
AS
You might also want to create an index on this view. It will not speed up things now but later (in Part 2) it will be useful.
CREATE UNIQUE INDEX WEEKLY_DOW_PK ON WEEKLY_DOW (PDATE);
Looking at the data in the view we just created it is obvious that the value of the dow index is steadily declining, but a period of decline does not necessarily mean a full-blown crisis. We shall make the assumption that in order to have a real crisis besides the falling stock prices we need unnaturally large variability in the dow jones values. We define variability as (maxp-minp)/avgp. One thing you might want to change is the definition of minp and maxp. Instead of selecting minp:=MIN(pclose) you can select minp:=MIN(plow), and the same for maxp - maxp:=MAX(phigh). The results will not be substantially different.
Now we are ready to create the most interesting SQL statement today, one that will show us when the crisis really hit dow jones. Without further ado, here it is:
SELECT
TO_CHAR(pdate,'yyyy-mm-dd') "Fridays" ,
minp "Min close",
maxp "Max close",
avgp "Avg close",
ROUND((maxp-minp)/avgp,2) "Variance" ,
RANK() OVER (ORDER BY (maxp-minp)/avgp DESC) "Variance rank",
RANK() OVER (ORDER BY avgp) "Avg rank"
FROM WEEKLY_DOW
WHERE TO_CHAR(pdate,'d')=6
ORDER BY pdate DESC;
And here are data for October and November:
Fridays Min close Max close Avg close Variance Var rank Avg rank
---------- ----------- ----------- --------- ---------- -------- --------
2008-11-28 8046.42 8829.04 8504.99 0.09 7 2
2008-11-21 7552.29 8497.31 8131.94 0.12 3 1
2008-11-14 8282.66 8943.81 8687.26 0.08 9 3
2008-11-07 8695.79 9625.28 9176.82 0.1 5 7
2008-10-31 8175.77 9336.93 8854.74 0.13 2 5
2008-10-24 8378.95 9265.43 8792.05 0.1 6 4
2008-10-17 8451.19 9387.61 8926.53 0.1 4 6
2008-10-10 8451.19 10325.38 9336.08 0.2 1 8
2008-10-03 10325.38 11143.13 10666.42 0.08 8 9
Lets see what these data mean. Take for example the second row (2008-11-21), its "Avg rank" is equal to 1 which means that (refer to our query) during the week ending on this day the average value of Dow Jones has been at its absolute lowest. Similarly, the weekly variability of Dow Jones has been highest during the week ending on 2008-10-10. Examining the ranks we clearly see that those two months have bee months of record low values and record high variability. This is a sure sign of a rampaging financial crisis. For comparison lets run the same sql statement but only for dates older that 2008-06-01 (here, for brevity, we only list the rank()-columns):
Fridays Variance rank Avg rank
---------- --------------- ---------
2008-05-30 31 14
2008-05-23 7 16
2008-05-16 27 18
2008-05-09 22 19
2008-05-02 28 20
2008-04-25 30 17
2008-04-18 5 11
2008-04-11 23 13
2008-04-04 10 10
The ranks are completely random - no crisis then. Please note that in both cases your results may differ depending on the exact starting set of data you use. Meaning that if you have downloaded e.g. 16 months of data, and not exactly the same set as I have downloaded you figures will be different. Qualitatively, however, the results will be the same.
You can also play with other periods and see what the data show, or load a bigger dataset from yahoo and analyze the data. You can load even the full dataset - starting from year 1928 and try to automate the process of crisis detection. Be careful to only analyze short periods of time (i.e. 1 year) otherwise the results will not make much sense. This can be fun.
PART 2
How to (automatically) calculate all financial crisises from 1930 until 2008? First you should get the daily Dow Jones data from Yahoo, as far back in time as you can (which is 1928) and load it as we did above. Alternatively download the CSV file which I prepared (this one does not have the PADJ column, but you don't need it). Refresh the WEEKLY_DOW view because we defined it as "refresh on demand".
CALL DBMS_MVIEW.REFRESH('weekly_dow');
Then let's create a table called "DATES" as this:
CREATE TABLE "DATES" (
"PDATE" DATE NOT NULL ENABLE,
"VARRANK" NUMBER,
"AVRANK" NUMBER,
CONSTRAINT "DATES_PK" PRIMARY KEY ("PDATE")
);
Here AVRANK and VARRANK will hold the values of "Avg rank" and "Variance rank" columns from the example above. Then we populate the table with all Fridays starting from 3th of January 1930 until 28th of November 2008.
INSERT INTO dates(pdate)
SELECT dd
FROM dual
MODEL
DIMENSION BY(0 idx)
MEASURES(sysdate dd)
RULES ITERATE(5000) UNTIL (dd[iteration_number]<=TO_DATE('1930-01-03'))
(dd[iteration_number]=TO_DATE('2008-11-28')-7*iteration_number)
If you don't understand how this work read about the
MODEL clause. Now we have to populate the values of the columns VARRANK and AVRANK for all dates in our new table. To do this we must essentially repeat the procedure above but limiting our calculations for a period of time equal to approximately an year. There is no point rank()-ing weekly data over an 80 year period, is there? So in order to calculate the coefficients for, let's say, 2008-11-28 we must do the following:
SELECT ROUND(AVG(y.vr)), ROUND(AVG(y.ar))
FROM (
SELECT ar,vr FROM (
SELECT
RANK() OVER (ORDER BY (maxp-minp)/avgp DESC) vr,
RANK() OVER (ORDER BY avgp) ar
FROM WEEKLY_DOW
WHERE TO_CHAR(pdate,'d')='6'
AND pdate between TO_DATE('2007-11-28') and TO_DATE('2008-11-28')
ORDER BY pdate DESC
)x
WHERE ROWNUM<=8)y;
Please note that we average the coefficients for two months back (8 weeks) and we only work with Fridays (TO_CHAR(pdate,'d')='6' ). In this way we hope to offset random fluctuations as much as possible. We will need to repeat this for every single date in the DATES table. So its better to write a procedure, but first I will create a new TYPE.
CREATE OR REPLACE TYPE dattyp AS OBJECT ( varrank number , avrank number );
CREATE OR REPLACE TYPE dfunc AS TABLE OF dattyp;
This is not the smartest way to achieve what we plan to achieve but it serves well to demonstrate some features of Oracle. Now I will wrap our SQL in a function which will return dfunc, which we can later use directly to update the DATES table.
CREATE OR REPLACE FUNCTION dowvar(indate IN DATE)
RETURN dfunc AS dowstr dfunc;
BEGIN
SELECT dattyp(ROUND(AVG(y.vr)), ROUND(AVG(y.ar)))
BULK COLLECT INTO dowstr
FROM (
SELECT ar,vr FROM (
SELECT
RANK() OVER (ORDER BY (maxp-minp)/avgp DESC) vr,
RANK() OVER (ORDER BY avgp) ar
FROM WEEKLY_DOW
WHERE TO_CHAR(pdate,'d')='6'
AND pdate between indate-360 and indate
ORDER BY pdate DESC
)x
WHERE ROWNUM<=8)y;
RETURN dowstr;
END dowvar;
This function receives a single date as a parameter and returns a tuple representing in our case (varrank, avrank). Note that the innermost SELECT statement makes good use of the index we already defined on the PDATE column of the materialized view.
We are ready to proceed with the population of the table with a single UPDATE statement:
UPDATE DATES SET (varrank,avrank)=(SELECT varrank,avrank FROM TABLE(dowvar(pdate)));
This is very fast if you use a materialized view - with an index on PDATE it takes 3 sec, without an index on PDATE it takes - 30 sec, and if the view was not materialized it takes approx. 10 min. After it's finished we are good to go.
We will make the following assumption. AVRANK and VARRANK indicate a crisis if both are not bigger than 10. So we need an SQL statement to show us all time intervals where both variables have consistently stayed below 10. It is a difficult one to write, especially if we want the time intervals to be automatically coalesced, i.e. instead of having:
PDATE VARRANK AVRANK
------------------------- ---------------------- ------
08-11-28 5 5
08-11-21 5 5
08-11-14 5 5
08-11-07 5 5
08-10-31 8 5
08-10-24 10 5
That's why we will use analytical functions in Oracle.
SELECT istart||' to '||iend, (iend-istart)||' days' FROM (
SELECT
iend,
LEAD(istart) OVER (ORDER BY pdate DESC) istart
FROM (
SELECT
pdate,
(CASE WHEN (NVL(LAG(pdiff,1) OVER (ORDER BY pdate DESC),0))<>7 AND pdiff=7 THEN pdate ELSE null END) iend,
(CASE WHEN (NVL(LAG(pdiff,1) OVER (ORDER BY pdate DESC),0))=7 AND pdiff<>7 THEN pdate ELSE null END) istart
FROM (
SELECT pdate,
pdate-(LAG(pdate) OVER (ORDER BY pdate)) pdiff
FROM dates WHERE varrank<=10 AND avrank<=10)
)
WHERE istart IS NOT NULL OR iend IS NOT NULL
) WHERE iend IS NOT NULL
ORDER BY iend DESC
This gives us all the intervals (along with their length is days) where, according to our criteria, there has been a significant financial crisis:
08-10-24 to 08-11-28 35 days
74-09-20 to 74-11-01 42 days
73-06-08 to 73-06-29 21 days
70-06-05 to 70-07-17 42 days
66-08-19 to 66-10-28 70 days
62-06-01 to 62-08-03 63 days
46-10-11 to 46-10-25 14 days
37-10-15 to 37-12-10 56 days
31-12-18 to 32-01-08 21 days
31-11-06 to 31-11-20 14 days
You might want to play around with the parameters, e.g. changing varrank<=10 to varrank<=15, etc. You might also try to search if in these periods there really was a financial crisis. Another thing you can try is to redefine WEEKLY_DOW to use MIN(plow) and MAX(phigh) instead of MIN(pclose) and MAX(pclose) and then to recalculate the DATES table. This will give you slightly different crisis intervals. All the above intervals will still be there but a few new ones will emerge (like: 1987-12-04 to 1988-01-22).
That's it for today. I hope you enjoyed this little exercise as much as I did.