09 Oct

$P[|v-u| > \epsilon] \le 2e^{-2\epsilon^2N}$

where $v$ is the in sample probablity

$u$ is the out of sample probability

$\epsilon$ is the tolerance

$N$ is the size of the sample

26 Dec

I am doing a MOOC on data visualization offered by Wesleyan University. As a part of that course I need to work on a research question that I’ll be working on during the course. The dataset I’ve chosen is the National Longitudinal Study of Adolescent Health (AddHealth) dataset. My initial work is going to be exploratory and I will be working towards interesting hypothesis. For my week 1 I’ll be focusing on:

1. Exploring the usage of Alcohol vis-a-vis the relationship history. How does alcohol usage affect the relationships of adolescents.
2. Later on I plan to work with other dimensions that are there in the dataset for a explainatory presentation.

Literature Survey

The studies found a strong protective effect of marriage on substance use and abuse. Their research indicated that single young adults had higher rates of substance abuse. They also studied the effect of the quality of relationships. They observed that higher quality relationship had a negative association with smoking.

19 Dec

Over the last 2 weeks I’ve been working with cancer researchers from UC Davis and NIH for helping them use the Data Explorer we’ve been building at Emory. The use of SPLOMs was suggested by my mentor Ashish and in this post I’ll discuss how SPLOMs are crucial for exploratory analyses. Most of the ideas in this post emerged from the meetings and discussions with Ashish.

In exploratory analysis the goal is to use visual methods to find hidden patterns in the data and to help analysts coming up with a hypothesis. The DataExplorer‘s goals are just that.  I won’t be talking much about the specifics of the work we are doing but instead will focus on using SPLOMs effectively and the reasons we’re incorporating it in the DataExplorer.

Bostock’s D3 SPLOM exmample

A SPLOM is a Scatter Plot Matrix. It lets you do pairwise comparison of different data attributes. Its a really effective way of identifying relationships in the dataset. Following are some insights that we gathered while using SPLOMs with TCGA datasets.

1. Quick summary of correlations between multiple attributes. Choosing which attributes to represent on the SPLOM is crucial. For our purpose we used Domain expertise provided by the experts from UC Davis. In case domain expertise is not available statistical techniques like PCA etc. can be used to identify attributes to be represented as SPLOMs.
2. Allowing drill down and working with a limited datasets. The DataExplorer lets you filter datapoints on the SPLOM to explore relations in the subsets of the data. This step is crucial as it lets you drill down and explore related subsets of the data.
3. Whats hidden can also be useful! Often visualization systems hide filtered data points. We chose to “gray out” filtered data points to be able to see what gets hidden as often thats useful in formulating hypothesis and analysis.
4. Handling messy/missing data. Biomedical data is messy! Any realworld data set is messy. Handling it properly is crucial to a useful visual experience. For our purposes we snapped the missing data on the negative axes.
5. Continuos variables. The SPLOM works best when the attributes are continous. For discrete variables you could use bubble charts on the SPLOM.

We use a fork of dc.js . I’ll be posting a guide about using dc.js for creating SPLOMs.

14 Dec

http://www.datasciencebowl.com/ Looking forward to participating this year. The challenge is to “develop an algorithm to empower doctors to more easily diagnose dangerous heart conditions, and help advance the science of heart disease treatment.”

31 Jan

Text summarization is a difficult challenge that is faced by  NLP researchers. Currently I am experimenting with a few text-summarization algorithms in my projects. One of them is LexRank. It is a graph based algorithm that uses a similarity function(cosine similarity in the original paper) to compute similarities between different sentences. It uses a pre-defined threshold to build the graph of the documents, creating an edge between 2 sentences(nodes) every time the similarity is above the threshold. They also used a Pagerank-like scheme to rank the sentences(nodes).

In this post we shall use sumy, a python based library that implements lexrank and a few other summarisation libraries. Following is an example code that reads from a plain text file and generates a summary.

09 Jan

Though Linear Regression may seem somewhat dull compared to some of the
more modern statistical learning approaches described in later chapters of this book, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for newer approaches: as we will see in later chapters, many fancy statistical learning approaches can be seen as generalization or extensions of linear regression. Consequently, the importance of having a good understanding of linear regression before studying more complex learning methods cannot be overstated.
-Gareth et al,  An Introduction to Statistical Learning
From an answer to one of my questions on Coursera’s forums on the relevance of Regression in modern Data science.

04 Jan

This new years I am gonna be working on deep learning. I have decided to experiment with deep learning techniques in biomedical information retrieval. I have read about Neural networks, worked with them on some projects, and even implemented my own Neural networks library in Javascript. But this time am taking baby steps and reviewing all basic concepts along my way.

Am using Neural networks and deep learning, which seems to be a free online collection of essays, being converted into a book, as my reference point. I am also using Deeplearning.net‘s reading list. Deep learning by Bengio et al. is a work in progress, the draft is freely available online and using that too.

### Using Perceptrons for Implementing Logic gates

Logic gates are simple to understand. They are often used as examples to introduce students to Neural networks. Now if we are to implement an OR gate, we basically have to implement the following truth table:

x y Z
0 0 0
0 1 1
1 0 1
1 1 1

A perceptron is defined by 2 parameters:  $w$ and $b$, the weight vector and the bias respectively. Now consider the perceptron shown in the following figure. It implements the OR gate.  Try and check the perceptron for different values of x and y and observe the output value. You’ll find that it mimicks the OR gate.

Perceptron for OR gate

02 Dec

This is a continuation of the ‘Analyzing sentiments on twitter’ post series. You can/should read the part I of this series, which talks about streaming twitter data onto a browser. The goal of these posts is to create a live browser app that listens to some tweets and visualizes their sentiments on the browser in real time.

### Sentiyapa.js a quick fix sentiment analyzer

A while back I wrote sentiyapa.js which uses the AFINN list to compute sentiments for a given text. The basic idea is to split the text into a bag of words, for each word, if there exists a sentiment score in the AFINN list, add it, then normalize the score.
https://github.com/lastlegion/sentiyapa.js/blob/master/sentiyapa.js

This is a quick fix technique, we shall use some more sophisticated techniques in subsequent experiments. But this preliminary study itself gave some interesting results.

15 Nov

Sentiment analysis is a widely studied field in the field of Natural language processing. In this series we try to understand sentiment analysis. We’ll write our own quick-fix sentiment analyzer. In subsequent posts we’ll explore techniques to visualize the social media sentiment.

### Streaming Twitter Data

In this post we shall track twitter on a hashtag and push those tweets live to the browser. We assume that you have Node installed. We assume you know how to configure and run an Express web server on node.

npm install node-tweet-stream

This installs node-tweet-stream which lets you stream twitter data on your node server. We push these tweets to the client whenever we receive any tweet. The architecture is as follows:

On the server side we emit the tweet every time we receive it:

We listen for tweets on the client side:

So thats pretty much all you need to do to get a stream of tweets on your browser. Once you have this stream you can add your presentation logic to create visualizations or other fancy stuff with the tweets. You could also do computationally intensive work on tweets on the server and push it the result to the client along with the tweet.

12 Nov

Disclaimer: I don’t advice using Javascript for data science. I do write/use learning libraries at times just for the fun it and ofcourse Atwoods’ law. This post is about the theoretical background for linear regression not the Javascript implementation.

I while ago I wrote lineareg.js, a Javascript library that lets you fit a line on a dataset. You can find the source code on github or install it from npm. I realized I never went about describing it. So here it is:

The crux of the code is in the cost computation.
The hypothesis $h=\theta.X$ is our prediction vector.
Difference $D = h-y$
The cost function $J = \frac{1}{2m}\sum_{i=1}^{n}D$

Now we need to minimize this cost function. For this we use gradient descent to minimize it. To find the local minima gradient descent takes a step in the greatest negative gradient in every iteration. The number of iterations.

You can find the source code on Github.
Or install it from npm “npm install lineareg“