Wednesday, October 26, 2011

PostgreSQL cheat sheet

A handy cheat-sheet for the PostgreSQL database, for when I'm too lazy to dig through the docs or find another cheat-sheet.

Start and stop server

sudo su postgres -c '/opt/local/lib/postgresql/bin/postgres -D /opt/local/var/db/postgres/defaultdb'
cnt-z
bg

or

su -c 'pg_ctl start -D /opt/local/var/db/postgres/defaultdb -l postgreslog' postgres

To shutdown

sudo su postgres -c 'pg_ctl stop -D /opt/local/var/db/postgres/defaultdb'

Run client

psql -U postgres
sudo -u postgres psql

Commands

Postgres commands start with a backslash '\' character. Type \l to list databases and \c to connect to a database. \? shows help and \q quits. \d lists tables, views and sequences, \dt lists tables.

Granting access privileges

create database dbname;
create user joe_mamma with password 'password';
grant all privileges on database dbname to joe_mamma;
grant all privileges on all tables in schema public to joe_mamma;
grant all privileges on all sequences in schema public to joe_mamma;

See the docs for GRANT.

SQL dump and restore

pg_dump -U postgres dbname | gzip > dbname.dump.2011.10.24.gz
gunzip < dbname.dump.2011.10.24.gz | sudo -u postgres psql --dbname dbname

For more, see Backup and Restore from the Postgres manual.

Truncate

Delete all data from a table and related tables.

truncate my_table CASCADE;

Sequences

Sequences can be manipulated with currval and setval.

select currval('my_table_id_seq');
select setval('my_table_id_seq',1,false);

Trouble-shooting

If you seen an Ident authentication error...

FATAL:  Ident authentication failed for user "postgres"

... look in your pg_hba.conf file. Ask Postgres where this file is by typing, "show hba_file;".

sudo cat /etc/postgresql/9.0/main/pg_hba.conf

You might see a line that looks like this:

local  all  all      ident

What the 'ident' means is postgres uses your shell account name to log you in. Specifying the user on the command line "psql -U postgres" doesn't help. Either change "ident" in the pg_hba.conf to "md5" or "trust" and restart postgres, or just do what it wants: "sudo -u postgres psql". More on this can be found in “FATAL: Ident authentication failed”, or how cool ideas get bad usage schemas.

Machine-learning: gradient descent

The first section of Andrew Ng's Machine Learning class is about applying gradient descent to linear regression problems.

Andrew Ng

Our input data is an m-by-n matrix X, where we have m training examples with n features each. For these training examples, we know the expected outputs y where y is the variable we're trying to predict. We want to find a line defined by the parameter vector ϴ that minimizes the squared error between the line and our data points.

Gradient descent takes a cost function, which is the squared error of the prediction vs. the training data. I think the 2 in the denominator is there so that it cancels out when we take the derivative, leaving us with a simpler gradient function.

The update rule for each ϴj is the partial derivative of the cost function with respect to ϴj.

Part of the challenge is converting this to matrix notation, to take advantage of fast matrix arithmetic algorithms.

Next we vectorize the update rule and show how to compute least squared error directly with the normal equation.

In action, gradient descent gradually approaches optimal values for ϴ. How gradual depends on the learning rate, α.

While the classwork was done in Octave, I also did a simple gradient descent implementation in R.

Saturday, October 22, 2011

Octave cheat sheet

I'm mucking about with Octave, MATLAB's open source cousin, as part of Stanford's Machine Learning class. Here are a few crib notes to keep me right side up.

The docs for Octave must be served from a Commodore 64 in Siberia judging by the speed, but Matlab's Function Reference is convenient. The Octave WikiBook covers a lot of the basics.

Matrices

Try some matrix operations. Create a 2x3 matrix, and a 3x2 matrix. Multiply them to get a 2x2 matrix. Try indexing.

>> A = [1 2 3; 4 5 6]
A =
   1   2   3
   4   5   6

>> B = 2 * ones(3,2)
B =
   2   2
   2   2
   2   2

>> size(B)
ans =
   3   2

>> A * B  % matrix multiplication
ans =
   12   12
   30   30

>> who    % list variables
A    B    ans

>> A(2,3) % get row 2, column 3
ans =  6

>> A(2,:) % get 2nd row
ans =
   4   5   6

>> A'     % A transpose
ans =
   1   4
   2   5
   3   6

>> A' .* B  % element-wise multiply
ans =
    2    8
    4   10
    6   12

Sum

sum(A,dim) is a little bass-ackwards in that the columns are dimension 1, rows are dimension 2, contrary to R and common sense.

>> sum(A,2)
ans =
    6
   15

Max

The max function operates strangely. There are at least 3 forms of max.

[C,I] = max(A)
C = max(A,B)
[C,I] = max(A,[],dim)

For max(v), if v is a vector, returns the largest element of v. If A is an m x n matrix, max(A) returns a row vector of length n holding the largest element from each column of A. You can also get the indices of the largest values in the I return value.

To get the row maximums, use the third form, with an empty vector as the second parameter. Oddly, setting dim=1 gives you the max of the columns, while dim=2 gives the row maximums.

Navigation and Reading data

Perform file operations with Unix shell type commands: pwd, ls, cd. Import and export data, like this:

>> data = csvread('ex1data1.txt');
>> load binary_file.dat

Printing output

The disp function is Octave's word for 'print'.

disp(sprintf('pi to 5 decimal places: %0.5f', pi))

Histogram

Plot a histogram for some normally distributed random numbers

>> w = -6 + sqrt(10)*(randn(1,10000))  % (mean = 1, var = 2)
>> hist(w,40)

Plotting

Plotting

t = [0:0.01:0.99];
y1 = sin(2*pi*4*t); 
plot(t,y1);
y2 = cos(2*pi*2*t);
hold on;         % "hold off" to turn off
plot(t,y2,'r');
xlabel('time');
ylabel('value');
legend('sin','cos');
title('my plot');
print -dpng 'myPlot.png'
close;           % or,  "close all" to close all figs

Multiple plots in a grid.

figure(2), clf;  % select figure 2 and clear it
subplot(1,2,1);  % Divide plot into 1x2 grid, access 1st element
plot(t,y1);
subplot(1,2,2);  % Divide plot into 1x2 grid, access 2nd element
plot(t,y2);
axis([0.5 1 -1 1]);  % change axis scale

heatmap

figure;
imagesc(magic(15)), colorbar

These crib notes are based on the Octave tutorial from the ml class by Andrew Ng. Also check out the nice and quick Introduction to GNU Octave. I'm also collecting a few notes on matrix arithmetic.

Defining a function

function ret = test(a)
  ret = a + 1;
end

More

Tuesday, October 18, 2011

Python cheat sheet

The most important docs at python.org are the tutorial and library reference.

Pointers to the docs

Important modules: sys, os, os.path, re, math, io

Inspecting objects

>>> help(obj)
>>> dir(obj)

List comprehensions and generators

numbers = [2, 3, 4, 5, 6, 7, 8, 9]
[x * x for x in numbers if x % 2 == 0]

Generators might be thought of as lazy list comprehensions. Really they're just a compact syntax for defining an iterator.

def generate_squares():
  i = 1
  while True:
    yield i*i
    i += 1

Main method

def main():
    # do something here

if __name__ == "__main__":
    main()

See Guido's advice on main methods. To parse command line arguments use argparse instead o the older optparse or getopt.

Classes

The tutorial covers classes, but know that there are old-style classes and new-style classes.

class Foo(object):
  'Doc string for class'

  def __init__(self, a, b):
    'Doc string for constructor'
    self.a = a
    self.b = b
  
  def square_plus_a(self, x):
    'Doc string for a useless method'
    return x * x + a
    
  def __str__(self):
    return "Foo: a=%d, b=%d" % (self.a, self.b)

Preoccupation with classes is a bit passé these days. Javascript objects are just bags of properties to which you can add arbitrary properties whenever you feel like it. In Ruby, you might use OpenStruct. It's quite easy in Python. You just have to define your own class. I'll follow the convention I've seen elsewhere of creating an empty class called Object derived from the base object. Why you can't set attributes on an object instance is something I'll leave to the Python gurus.

class Object(object):
    pass

obj = MyEmptyClass()
obj.foo = 123
obj.bar = "A super secret message
dir(obj)
['__doc__', '__module__', 'bar', 'foo']

You can add methods, too, but they act a little funny. Self doesn't seem to work.

Files

Reading text files line by line can be done like so:

with open(filename, 'r') as f:
    for line in f:
        dosomething(line)

Be careful not to mix iteration over lines in a file with readline().

Exceptions

try:
  raise Exception('My spammy exception!', 1234, 'zot')
except Exception as e:
  print type(e)
  print e
finally:
  print "cleanup in finally clause!"

Traceback prints stack traces.

Conditional Expressions

Finally added in Python 2.5:

x = true_value if condition else false_value

Packages, Libraries, Modules

What do they call them in Python? Here's a quick tip for finding out where installed packages are:

python -c 'import sys, pprint; pprint.pprint(sys.path)'

To find out what packages are installed, open a python shell and type:

help('modules')

Magic methods

Python's magic methods are the source of much confusion. Rafe Kettler's Guide to Python's Magic Methods sorts things out beautifully.

Saturday, October 08, 2011

Stanford Machine Learning class

Stanford is offering a free online version of it's Machine Learning class taught by Andrew Ng. Study groups are popping up everywhere. Cool!

The class officially starts Monday, October 10th, but the first few lectures are up already, broken into bite size pieces of 10 minutes or so. What I've seen so far is at a basic level, covering a course introduction and terminology. Ng then posses a linear regression problem.

We want to find a line y = ϴ0 + ϴ1 x such that we minimize the squared error between our line and our data points.

The solution is our first learning algorithm, gradient descent.

More about the Machine Learning class

The real class at Stanford is: CS229. Exercises are to be done in Octave. Recommended reading includes the usual suspects:

  • Pattern Recognition and Machine Learning, Christopher Bishop
  • Machine Learning, Tom Mitchell
  • The Elements of Statistical Learning, Hastie, Tibshirani and Friedman

Several of the Primers in Computational Biology series would probably make for good supplementary material.

There are threads related to the class on Quora and Reddit, for whatever that's worth. Also, see some good resources for learning about machine learning.