User:Rahul Palamuttam/Diary

From seed
Jump to: navigation, search

Data

Location of weather stations can be found here: http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/

The specific file is: http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

Example notebook: http://nbviewer.ipython.org/github/yoavfreund/UCSD_BigData/blob/master/notebooks/weather/Weather%20Analysis.ipynb

Diary

Hi Rahul, this is your Diary. Please fill it in reverse chronological order (newest updates on top)

My results can be found at the following github repostiroy : https://github.com/rahulpalamuttam/idc The ipython notebook with the scatterplot visuals can be found here : https://github.com/rahulpalamuttam/idc/blob/master/Notebooks/Bokeh-spark%20plot.ipynb

March 7 2016

I have been able to setup a dashboard of three plots. A line graph showing the distribution of users over time, a pie chart showing the distribution of users for a certain group, and zoomable scatter plot. Ontop of the dashboard is a slider, that allows users to scroll through different times. Julaiti's dataset covers January 5th to 26th, with a gap between the 12th and 25th, so that is what the slider "slides" over.

Included below is a diagram detailing the architecture of the SparkViz application, with Bokeh.

Finished quarter2plot.png


SparkViz Dataflow - ERD.jpeg

Feb 17 - Feb 29 2016

I met again with Juliati and discussed next possible steps. We discussed the need to have clean data available since it is entirely possible that people are lying about their location on twitter. We also brainstormed some mockup ideas for the website.

I am now able to have an updating distribution pie-chart given what is being plotted in the current window. Utilizing bokeh's column data source feature to update data sources on the front end, I was able to compute weighted averages of each of the colors. These percentages were then used to figure out the wedge size and range of sections of the pie chart.

I also want to create a graph of variation or how evenly distributed points are. The zoom and zoomout functionality when at the plot limits still need to be fixed. VoterMapDistribution.png

--Yoavfreund 11:24, 7 March 2016 (PST) A step in the right direction. Here are some suggestions:

  • In addition to the global pie chart, associate a pie chart with each region. At the global level, region = state or large city.
  • Associate with each pie-chart color a label (bernie, clinton, trump etc.)
  • When hovering over a pie section show the trending hashtags for that location / group.

Feb 16 2016

The bokeh server can be used to store scripts. This allows us to generate script tags that can be embedded in HTML pages. I discussed this earlier with Juliati and the need to embed dynamic plots from the bokeh server in html pages. This also means that the pages can be embedded in notebook environments.

However, a limitation is that only one view of the page/plot can be open and modified at a time since the server-side variables are global.

Here is an example of the script tag:

<script

   src="http://localhost:5006/autoload.js?bokeh-autoload-element=d0ec28e2-1b9f-4411-8300-187a6ac9c8dd&bokeh-session-id=qZhSaQfRLbKovaHi2h2wkNieUAx33KwoRum1GIQ5kAtR"
   id="d0ec28e2-1b9f-4411-8300-187a6ac9c8dd"
   data-bokeh-model-id="efe6c713-3cf4-4c41-bbaa-234d3f19c66d"
   data-bokeh-doc-id=""

></script>

In order to enable the websocket to listen in on connecting webpages we need to launch the server with the following command-line parameters.


bokeh serve --allow-websocket-origin=localhost:63342

Feb 14 2016

Taking Juliati's data set I am able to assign colors to the points with respect to the seven groups in the dataset. The colors are assigned by group-id like so :

{0:'black', 1:'red', 2:'blue', 3:'yellow', 4:'green', 5:'pink', 6:'brown'}

It is interesting to note that points which have higher density appear as a single point with a deeper color. The dataset also includes points that are not in the united states, however the display has been panned and zoomed to omit those.

Grouplot.jpg

Feb 12 2016

The reason why the plot examples of the states were not working is because certain states in the geographical glyph array were not rendering properly forcing the entire plot to not render. By taking out these states, I am able to plot points on atleast a partial map of the united states.

Next step will be to plot entire countries.

PartialUSAMap.jpg

Feb 8 2016

I am able to plot points transparently. The effect is regions of higher density showing a more solid color than regions of lower density. Bokeh enables you to do this by using the fill_alpha parameter. The below image consists of cricles of size 15 and fill_alpha=0.3

TransparentMAPBokehjpg.jpg

Feb 1 2016

Bokeh allows you to plot points on google maps - using their GMapPlot class. As shown below, it is a sample of 1000 points plotted on the globe. However, zooming in and zooming out does not update the ranges. For this reason, I cannot issue a server callback since it is only triggered on changing values of bounds. I am not sure why this is occurring - it could be because the entire plot is on top of an already computed map glyph. The likely solution is to construct our own map using glyphs and plotting on top of that. USA BokehMaps.jpg

Jan 29 2016

I am able to generate plots of ghcnd stations using the bokeh server and update them using the server callback.

The callback function takes the newly updated x axis range and y axis range and submits a query for N size sample in that range.

Here is an image of a sample of 1000 points.

World thousand.jpg

After scrolling you will be able to see a sample of 1000 points from the United States region

Thousand usa.jpg

If we take 100,000 points we are able to see entire continents.

One hundred thousand global.jpg

Update 2016

On Seamless rendering of scatterplots with a spark backend.

The core issue is efficiently communicating between the "frontend" widget and "backend" RDD. The past quarter I explored custom rendering libraries - namely Bokeh and also IPython's callback functionality.


1) Utilizing d3 api and callback functionality.

The first requirement is establishing zoom actions on user input. By keeping a cache in browser memory, I can query for results and avoid querying the backend RDD.

https://github.com/rahulpalamuttam/idc/blob/master/PrettyScatter/scatterchart.js#L177-L228

var zoomBeh = d3.behavior.zoom()

   .x(x)
   .y(y)
   .scaleExtent([0, 500])
   .on("zoom", zoom);

function zoom() {

      svg.selectAll(".dot")
       .attr("transform", transform);

}


However when I do need to fetch more points to display I need to launch a query on a spark RDD and obtain those resulting points. IPython notebooks expose a callback functionality which enables javascript code in widgets to call Ipython kernel code. This means we can call code imported from libraries - namely the RDDB code.

The following post calls it Python/javascript bidirectional communication. https://jakevdp.github.io/blog/2013/06/01/ipython-notebook-javascript-python-communication/

However the problem is that the callback writes data to disk which is then picked up by the javascript handler. This is abstracted away from the user. This would mean two writes to disk when querying a Spark RDD in an IPython notebook. The first write occurs when communicating results between the JVM and the python environment. The second write occurs during the callback.

Another drawback to this approach is that the callback seems to be very buggy in IPython. Even simple examples are difficult to launch because there are a variety of inputs to process. Furthermore, I am limited to using IPython notebooks with this technique because databricks notebooks do not expose callback functionality. Ipython notebooks enable callback functionality

2. Using d3 api and REST functionality

The second method is to create and use a spark server. The javascript code can launch jobs via rest api. The databricks notebook has a rest api for monitoring jobs and returns results in JSON format. However, the notebook does not provide a rest api for launching and terminating jobs.

https://forums.databricks.com/questions/1852/how-do-i-create-a-job-using-the-rest-api.html

Another way to accomplish this is to use the spark-jobserver from ooyala. However, the build often fails and the current spark-jobserver project does not supports the latest spark versions.

3. Using custom rendering libraries such as bokeh and plotly.

Bokeh enables notebook code to be entirely written in python. The library ships its own rendering capabilities. However, it does not enable seamless integration via callbacks. Instead you need to re-render the plots.

Monday November 23 2015

Able to sample on javascript side, without compromising scale. The points do not grow larger or smaller. Furthermore, I only choose to display a set number of points given the range and domain values of the current view of the scatter plot.

Next step is too look at calling functions from the Spark Job server via Rest API.

Friday November 20 2015

Scala.js and Spark for mobile??


https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html

https://issues.apache.org/jira/browse/SPARK-6646

http://blog.knoldus.com/2014/09/03/meet-up-on-scalas-evolving-ecosystem-introduction-to-scala-js/

https://www.youtube.com/watch?v=PQuDD_EHM9I

Thursday November 12 2015

How to use custom designed JavaScript, images, and styles in Databricks Notebook :

1. First place assets into the '/FileStore' DBFS mount

2. 'displayHTML()' can reference the assets using '/files'


Possibility of feeding scala sequences or arrays ot js code?

- Not possible to pass entire array. Can directly access the fields or stoer the array as a JSON object inside the file store and load it inside an iFrame

D3 Filestore Demo

Sunday November 8 2015

To use tools from the virtualenv, run $source env.sh. This ports all python aliases to point to the appropriate execution files under .env/bin

Notes :

The current implementation relies on two classes. The Renderer and RDDDB.

Renderer: code that calls the RDDDB and generates a scatter-plot).

   - consists of basic functionality to render a scatter plot using bokeh.
     Bokeh is a python library for generating interactive visualizations. 
     The Bokeh site can be found here : http://bokeh.pydata.org/en/latest/
     
     The Renderer class consists of the renderer function which takes a range query
     and outputs a bokeh scatter plot. The scatter plot consists of circular elements.
     These elements need to be adjusted based on the number of elements on the scatterplot
     and the dimensions of the query. For example a query of (1,1,1,1) is a 1 by 1 box.
     However a plot with elements of size 1 take up the entire space.
     
     The Renderer runs the range queries on top of the RDDDB
     

RDDDB

   - consists of the spark RDD. 
      For this particular example the RDD consists of 10 million randomly
      generated pairs of numbers between 0 and 100.
      
      The RDDDB class also persists a local cache (on the head node) of point-pairs based on
      the largest query run to date. The problem however is that when running
      a large enough query the collect call on the RDD could extract
      all the datapoints. Instead what we need to do is take a certain number
      of elements from an RDD as memory permits.
      
      RDDDB has a function called crossfilter. The crossfilter function takes the 
      minimum and maximum values and first checks if the pointsCache has elements 
      matching the ranges. If the ranges are larger than what is currently stored 
      in the pointsCache an RDD action is launched to re-extract the necessary details.
      Otherwise the pointsCache is queries for faster results.
      
      Note that the current implementation supports faster results when going from
      a larger query range to a smaller query range. I will need to look into ways
      to either hash results of range queries or do a faster search. For larger
      local memory the search may be slower than in the RDD cache since the search
      is run in parallel.
      

IPython: consists of the renderer

   - Unfortunately the section of code in the IPython notebook will need to be run again
     with the call to render. While we still take advantage of the pre-loaded RDD and
     local pointcache - I am not able to find a way to do a callback to python code
     to re-populate the data results. Would appreciate help with regard to this.
     Otherwise we achieve filtering of large datasets by running a query on the RDD
     or the local cache and throwing the results into a d3 renderred scatter plot.
     Bokeh constructs this behind the scenes
     

TODO ::

   1. Better caching/query algorithm to support both zoom in's and zoom out's. Currently
      only get benefits of local cache when zooming in.
   2. Interactive utility to callback to the render() function through a javascript button.

Saturday November 7 2015

Working IPython notebook with bokeh scatter plot. Results found at the following github repostiroy : https://github.com/rahulpalamuttam/idc The ipython notebook with the scatterplot visuals can be found here : https://github.com/rahulpalamuttam/idc/blob/master/Notebooks/Bokeh-spark%20plot.ipynb

Wednesday October 28 2015

What sort of range-query caching algorithms could I be using? I have got a few d3 plots working as html widgets in IPython notebook. However, the callback function is difficult to implement. David Lisuk's idcdashboard writes the entire data to the html file and then renders it. However the actual callback is very strange. Is the javascript library calling python code to get the arrays?

Wednesday October 14 2015

I am struggling with figuring out how to re-inject data from a resampling. BlinkDB is built ontop of Spark's RDD's and accomplishes the fast filtering capabilities. It is still in alpha stage but here is the link - http://blinkdb.org/ It supports random sampling within a range of data stored in an RDD.

Friday October 9 2015

After several chats with David, I decided to move from using the idc dashboard and playing around with more contemporary examples. In particular I found plotly and bokeh. Both use simple API's that generate graphs. They do not use d3.js but rather define their own javascript modules for graphical output. I am still not sure how to update a datasource that has not been preloaded into a javascript environment. Bokeh enables updates to be made to the bokeh server, but there doesn't seem to be a way to publish "new" data to the bokeh server. You can define callbacks but it only affects how the region is filtered/zoomed in rather than inserting new datapoints.

Monday October 5 2015

I worked with David to fix the issues with iDC dashboard. The python code compiles successfully, however the Tornado server fails to find d3.js in the static/d3.js directory. The idc dashboard tries to hard code javascript and html code into a string in python. Could this be the reason for errors? In particular the Dashboard.py in Dashboard package includes an $$INSERT$$ string that occurs in the beginning of the file. The __init__.py tries to paste html/javascript code in the position of the $$INSERT$$.

Tuesday September 29 2015

Just figured out how to update the user page. Was making updates to discussion before.


Monday September 28 2015

David Lisuk Responded with answers to questions :

1)For the Weather.ipynb Sampling_DF_Backend is unable to generate any dashboard because the dashboard variable is initially set to None. It fails when _resample_data(self) is called which checks for the existence of the dashboard variable.

Ans >> It seems like the latest commit has a few bugs. The commit 812be7f3140d934a334317faf312a7bd067a47f5 seems like it is more stable. Using this version I was able to get basic demo, nasdaq demo, and weather demo working. Do these work on the old version for you?

2) IPython versioning issues I've been running it with IPython 3.0 and it looks like ContainerWidgets is deprecated. I modified the code a little bit by getting rid of ContainerWidgets. I was running ipython version 2.2.0. It's very possible that the code would not work with 3.0 as I think a lot has changed in version 3.0. In fact I believe the new API's in 3.0 are easier to use/a lot of my code could be simplified if written for 3.0.

3) For the BasicDemo example, I am not able to get the scatter plot. This could be because ContainerWidgets is deprecated.

fairly detailed description of the parts of the project and explains the design decisions behind it https://github.com/dlisuk/masters_report.  For help with specific parts, possibly asking specific questions would help, also you can add me on facebook and we can talk live which may be helpful.  For handling spark RDDs, I wrote a restful API which would take filters from the app and compute samples to be downloaded into the notebook.  At the time I believe this was a reasonable choice as pyspark was pretty weak; however, pyspark is more mature now (i've used it at palantir these last few months and saw it mature) so you should easily be able to implement the needed sampling in pyspark now without much difficulty.


Thursday September 24 2015

1) For the Weather.ipynb Sampling_DF_Backend is unable to generate any dashboard because the dashboard variable is initially set to None. It fails when _resample_data(self) is called which checks for the existence of the dashboard variable. 2) IPython versioning issues I've been running it with IPython 3.0 and it looks like ContainerWidgets is deprecated. I modified the code a little bit by getting rid of ContainerWidgets. 3) For the BasicDemo example, I am not able to get the scatter plot. This could be because ContainerWidgets is deprecated.


Hi Rahul, this is your Diary. Please fill it in reverse chronological order (newest updates on top)