09 Oct

$P[|v-u| > \epsilon] \le 2e^{-2\epsilon^2N}$

where $v$ is the in sample probablity

$u$ is the out of sample probability

$\epsilon$ is the tolerance

$N$ is the size of the sample

26 Dec

I am doing a MOOC on data visualization offered by Wesleyan University. As a part of that course I need to work on a research question that I’ll be working on during the course. The dataset I’ve chosen is the National Longitudinal Study of Adolescent Health (AddHealth) dataset. My initial work is going to be exploratory and I will be working towards interesting hypothesis. For my week 1 I’ll be focusing on:

1. Exploring the usage of Alcohol vis-a-vis the relationship history. How does alcohol usage affect the relationships of adolescents.
2. Later on I plan to work with other dimensions that are there in the dataset for a explainatory presentation.

Literature Survey

The studies found a strong protective effect of marriage on substance use and abuse. Their research indicated that single young adults had higher rates of substance abuse. They also studied the effect of the quality of relationships. They observed that higher quality relationship had a negative association with smoking.

19 Dec

Over the last 2 weeks I’ve been working with cancer researchers from UC Davis and NIH for helping them use the Data Explorer we’ve been building at Emory. The use of SPLOMs was suggested by my mentor Ashish and in this post I’ll discuss how SPLOMs are crucial for exploratory analyses. Most of the ideas in this post emerged from the meetings and discussions with Ashish.

In exploratory analysis the goal is to use visual methods to find hidden patterns in the data and to help analysts coming up with a hypothesis. The DataExplorer‘s goals are just that.  I won’t be talking much about the specifics of the work we are doing but instead will focus on using SPLOMs effectively and the reasons we’re incorporating it in the DataExplorer.

Bostock’s D3 SPLOM exmample

A SPLOM is a Scatter Plot Matrix. It lets you do pairwise comparison of different data attributes. Its a really effective way of identifying relationships in the dataset. Following are some insights that we gathered while using SPLOMs with TCGA datasets.

1. Quick summary of correlations between multiple attributes. Choosing which attributes to represent on the SPLOM is crucial. For our purpose we used Domain expertise provided by the experts from UC Davis. In case domain expertise is not available statistical techniques like PCA etc. can be used to identify attributes to be represented as SPLOMs.
2. Allowing drill down and working with a limited datasets. The DataExplorer lets you filter datapoints on the SPLOM to explore relations in the subsets of the data. This step is crucial as it lets you drill down and explore related subsets of the data.
3. Whats hidden can also be useful! Often visualization systems hide filtered data points. We chose to “gray out” filtered data points to be able to see what gets hidden as often thats useful in formulating hypothesis and analysis.
4. Handling messy/missing data. Biomedical data is messy! Any realworld data set is messy. Handling it properly is crucial to a useful visual experience. For our purposes we snapped the missing data on the negative axes.
5. Continuos variables. The SPLOM works best when the attributes are continous. For discrete variables you could use bubble charts on the SPLOM.

We use a fork of dc.js . I’ll be posting a guide about using dc.js for creating SPLOMs.

04 Jan

This new years I am gonna be working on deep learning. I have decided to experiment with deep learning techniques in biomedical information retrieval. I have read about Neural networks, worked with them on some projects, and even implemented my own Neural networks library in Javascript. But this time am taking baby steps and reviewing all basic concepts along my way.

Am using Neural networks and deep learning, which seems to be a free online collection of essays, being converted into a book, as my reference point. I am also using Deeplearning.net‘s reading list. Deep learning by Bengio et al. is a work in progress, the draft is freely available online and using that too.

### Using Perceptrons for Implementing Logic gates

Logic gates are simple to understand. They are often used as examples to introduce students to Neural networks. Now if we are to implement an OR gate, we basically have to implement the following truth table:

x y Z
0 0 0
0 1 1
1 0 1
1 1 1

A perceptron is defined by 2 parameters:  $w$ and $b$, the weight vector and the bias respectively. Now consider the perceptron shown in the following figure. It implements the OR gate.  Try and check the perceptron for different values of x and y and observe the output value. You’ll find that it mimicks the OR gate.

Perceptron for OR gate

02 Dec

This is a continuation of the ‘Analyzing sentiments on twitter’ post series. You can/should read the part I of this series, which talks about streaming twitter data onto a browser. The goal of these posts is to create a live browser app that listens to some tweets and visualizes their sentiments on the browser in real time.

### Sentiyapa.js a quick fix sentiment analyzer

A while back I wrote sentiyapa.js which uses the AFINN list to compute sentiments for a given text. The basic idea is to split the text into a bag of words, for each word, if there exists a sentiment score in the AFINN list, add it, then normalize the score.
https://github.com/lastlegion/sentiyapa.js/blob/master/sentiyapa.js

This is a quick fix technique, we shall use some more sophisticated techniques in subsequent experiments. But this preliminary study itself gave some interesting results.