GIScience

Jupyter Notebook Web Access on Google Cloud

Google has fantastic tutorials on how to set up a VM; the tutorials were a bit sketchy for how to get a Jupyter notebook server up and running that can be accessed via password through a web browser. This video filled the gap for me by providing careful attention to the firewall settings and other details that like the use of tmux (cheat sheet).

Messy Bash Script and Notes 

#tutorial from : https://www.youtube.com/watch?v=gMDQZPoMECE
#tmux tutorial from: https://www.youtube.com/watch?v=BHhA_ZKjyxo


sudo apt-get update
sudo apt-get --assume-yes upgrade
sudo apt-get --assume-yes install tmux build-essential gcc g++ make binutils
sudo apt-get --assume-yes install software-properties-common

sudo apt-get install python-setuptools python-dev build-essential
sudo easy_install pip

sudo -H pip install jupyter
#jupyter notebook --generate-config
#enter the config file, then change this information
#sudo nano ~/.jupyter/jupyter_notebook_config.py
#c = get_config()
#c.NotebookApp.ip = "*"
#c.NotebookApp.open_browser = False
#c.NotebookApp.port = 5678

#use tmux to keep things running
#jupyter notebook password >> create login password

# to run: jupyter-notebook --no-browser --port=5678

### external port must match VM firewall settings

Creating a MySQL Database on Google's Cloud Compute for Lahman's Baseball Data

Google recently released fully managed MySQL and postgreSQL instances that are very easy to deploy. While not GIS per se, SQL is a fundamental GISer skill in general, and so as a baseball fan, I decided to see how easy it would be to create a MySQL database for Sean Lahman's database. It took me about 15 minutes to figure out and do the work. Following is how it's done.

Step 0, go to this page and follow the directions. Frankly, I can't do it better than Google's quick start. After completing this tutorial, you will have created a small database.

Now, after doing that, you can download the Lahman's database.

1) First, exit MySQL:

mysql> exit

2) Use wget to download the most current SQL database from this page:

$ wget http://seanlahman.com/files/database/lahman2016-sql.zip

3) And then unzip the file:

$ unzip lahman2016-sql.zip

4) Open MySQL

$ gcloud beta sql connect <databasename> --user=root

5) See what databases exist on your MySQL server already.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

6) Create a new database for the Lahman data.

mysql> CREATE DATABASE lahman2016

7) See that your new database is there.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| lahman2016         |
| mysql              |
| performance_schema |
+--------------------+
4 rows in set (0.00 sec)

8) Connect to the new database.

mysql> use lahman2016;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed

9) Import the Lahman saved SQL database you downloaded. When you use this command, it will import the database and a long list of output will show as it adds the data.

mysql> source lahman2016.sql

10) Examine the database.

mysql> SHOW TABLES;
+----------------------+
| Tables_in_lahman2016 |
+----------------------+
| AllstarFull          |
| Appearances          |
| AwardsManagers       |
| AwardsPlayers        |
| AwardsShareManagers  |
| AwardsSharePlayers   |
| Batting              |
| BattingPost          |
| CollegePlaying       |
| Fielding             |
| FieldingOF           |
| FieldingOFsplit      |
| FieldingPost         |
| HallOfFame           |
| HomeGames            |
| Managers             |
| ManagersHalf         |
| Master               |
| Parks                |
| Pitching             |
| PitchingPost         |
| Salaries             |
| Schools              |
| SeriesPost           |
| Teams                |
| TeamsFranchises      |
| TeamsHalf            |
+----------------------+
27 rows in set (0.00 sec)

mysql> SHOW COLUMNS FROM Batting;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| playerID | varchar(255) | YES  |     | NULL    |       |
| yearID   | int(11)      | YES  |     | NULL    |       |
| stint    | int(11)      | YES  |     | NULL    |       |
| teamID   | varchar(255) | YES  |     | NULL    |       |
| lgID     | varchar(255) | YES  |     | NULL    |       |
| G        | int(11)      | YES  |     | NULL    |       |
| AB       | int(11)      | YES  |     | NULL    |       |
| R        | int(11)      | YES  |     | NULL    |       |
| H        | int(11)      | YES  |     | NULL    |       |
| 2B       | int(11)      | YES  |     | NULL    |       |
| 3B       | int(11)      | YES  |     | NULL    |       |
| HR       | int(11)      | YES  |     | NULL    |       |
| RBI      | int(11)      | YES  |     | NULL    |       |
| SB       | int(11)      | YES  |     | NULL    |       |
| CS       | int(11)      | YES  |     | NULL    |       |
| BB       | int(11)      | YES  |     | NULL    |       |
| SO       | int(11)      | YES  |     | NULL    |       |
| IBB      | varchar(255) | YES  |     | NULL    |       |
| HBP      | varchar(255) | YES  |     | NULL    |       |
| SH       | varchar(255) | YES  |     | NULL    |       |
| SF       | varchar(255) | YES  |     | NULL    |       |
| GIDP     | varchar(255) | YES  |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+
22 rows in set (0.01 sec)

Encapsulation: getters, setters, public, private, and properties

This week in the Algorithms and Data Structures, we're going to briefly talk about encapsulation in Python (this post), implement a stack and a queue building on the linked list class (resources here), and then implement some more methods in the linked list like mean and standard deviation (post to come after the meeting).

Here's a Python.com tutorial that is a nice read to get started. All of the following code can be found in this Jupyter notebook.

Python Style

A fundamental idea in object oriented programming is encapsulation - the idea that attributes of class should be almost always private.

In the Algorithms and Data Structures for GIScientists study group, we recently implemented a linked list class in python following the lead from the video here. The video was great, and served our purposes well by getting everyone on the same page about linked lists, but raised a lot of questions for us in the group though about OOP in Python versus Java.

To followup on this are some videos that we will watch and discuss. 

Public and Private Variables

The following is from a post that is worth reading. Coding is cultural, and this post unpacks the mechanics the underly some of the cultural preferences between different languages with a specific focus on Python.

"Some people teach that _x is Python's equivalent of protected, and __x its equivalent of private, but that's very misleading.

The single underscore has only a conventional meaning: don't count on this being part of the useful and/or stable interface. Many introspection tools (e.g., tab completion in the interactive interface) will skip over underscore-prefixed names by default, but nothing stops a consumer from writing spam._eggs to access the value.

The double underscore mangles the name—inside your own methods, the attribute is named __x, but from anywhere else, it's named _MyClass__x. But this is not there to add any more protection—after all, _MyClass__x will still show up in dir(my_instance), and someone can still write my_instance._MyClass__x = 42. What it's there for is to prevent subclasses from accidentally shadowing your attributes or methods. (This is primarily important when the base classes and subclasses are implemented independently—you wouldn't want to add a new _spam attribute to your library and accidentally break any app that subclasses your library and adds a _spam attribute.)"

Property Decorators

This site provides a very nice consideration of public and private variables, and the use of @property. An excerpt:

"Getters and setters are used in many object oriented programming languages to ensure the principle of data encapsulation. They are known as mutator methods as well. ... These methods are of course the getter for retrieving the data and the setter for changing the data. According to this principle, the attributes of a class are made private to hide and protect them from other code."

Some General Thoughts on Style in Python

To start off with, in Java, encapsulation is implemented in a class by making all variables private and then using getters and setters. This is how I learned OOP.

Following is an example from last week's node class written in Python, but in a Java-like way:


class Node(object):

    def __init__(self, d, n = None):
        self.data = d
        self.next_node = n
        
    def get_next(self):
        return self.next_node
    
    def set_next(self, n):
        self.next_node = n
        
    def get_data(self):
        return self.data
    
    def set_data(self, d):
        self.data = d
    
    def __str__(self):
        print self.data
        print self.next_node

In the above example, the getters and setters are completely irrelevant. In your code, you can access the data and next_node by typing <nameOfObject>.data or <nameOfObject>.next_node.

As you can see, the class is written in a Java style and works just fine. The problem is that it doesn't follow convention. For reference, in PEP8, there are three references to properties in inheritance.

The following is an example of making the attributes of the node class private, but node, this still isn't best practice. It just makes things seem less redundant.

Node Class Private


class NodePrivate(object):

    def __init__(self, d, n = None):
        self.__data = d
        self.__next_node = n
        
    def get_next(self):
        return self.__next_node
    
    def set_next(self, n):
        self.__next_node = n
        
    def get_data(self):
        return self.__data
    
    def set_data(self, d):
        self.__data = d
    
    def __str__(self):
        print self.__data
        print self.__next_node

This is what the node class would look like with properties.


class NodeProperties(object):
    
    def __init__(self, d, n = None):

        self.__data = d
        self.__next_node = n
    
    @property
    def data(self):
        return self.__data
        
    @property
    def next_node(self):
         return self.__next_node
        
    @next_node.setter
    def next_node(self, n):
        self.__next_node = n
        
    @data.setter
    def data(self, d):
        self.__data = d
    
    def __str__(self):
        print self.__data
        print self.__next_node

But according to the interwebs, it's best to not use them at all, which would make the class look like this.


class NodePublic(object):
    
    def __init__(self, d, n = None):
        self.data = d
        self.next_node = n
        
    def __str__(self):
        print self.data
        print self.next_node

I think we can all agree, the totally public is really good looking.

 Yes you do, my friend, yes you do.&nbsp;

Yes you do, my friend, yes you do. 

All of the above code can be found in this Jupyter notebook.

Last of all, the real power of python and properties is that you can hack out the first node class, let everyone use it, and then at a later time realize you want to change some things, and use properties to encapsulate the public data so variable names don't have to change in the rest of the system. This probably isn't the best design principle. The general rule though still seems to be according to PEP8, if in doubt, make it private.

Resources on Stacks and Queues

Here is the Python 2 documentation for lists, which has the methods needed for stacks and queues. For a nice visual overview of these data structures, see this blog post. Following is a video that explains stacks and queues, chosen because Damien Gordon is by far the most dignified.

If you remain uninterested in implementing this data structure, here's a video showing the syntax with the list.

Resources for setting up Jupyter and linked lists

The following resources are for the first meeting of the algorithms and data structures study group. For a full list of the study group activities, see the main blog post.

Setting up Jupyter

  • Steps to install Jupyter notebook on your personal computer
  • Using the anaconda version is really easy because it installs all of the scientific computing libraries you need with it

Resources for Linked Lists

Videos

I selected the following video as our main prompt for two reasons: 1) the visuals are clear, 2) there is an example of a linked list being implemented in python, and - I lied, there are three reasons - 3) the video maker's name is Joe James

Other videos:

  • coursera video (which also summarizes an abstract data type well)
  • youtube query - if you find one you like better, let me know and I"ll add it to this post. If you make your own, let me know, I'll add it too.

Readings

Spark Notes

A linked list is a data structure. Unlike an array that has to have one block of memory that isn't cut up, a linked list contains pointers from each piece of data to the next in memory so it can be scattered around. The wikipedia link has a really nice overview.

Potentially the most important concept to get here (for going forward at a conceptual level) is the "node." Each node consists of one data element and a pointer to the next data element. The node is a fundamental concept because it is built on to get to trees and graphs. Both abstract data types are important for geographic information science. Trees are used across various data access methods (see R-Tree for example), and graphs and their respective algorithms are the basis of all road networks, among other things.

The diagram below shows how a linked list stores data and a pointer to the next piece of data.

  from Wikimedia Commons

from Wikimedia Commons

Note: the developers (Newell, Shaw, Simon) were all big time names in CS and artificial intelligence. Simon in particular is widely known across the cognitive sciences and won a Nobel Prize in Economics.

 [02-15-17] Meeting Recap

More resources we discovered:

  • Python documentation on classes
  • Very clearly laid out blog post from Code Fellows (thanks Josh!)
  • Code Fellows also referenced the book Problem Solving with Algorithms and Data Structures by Miller and Ranum

We met in Social Sciences 423 for a little over an hour. We watched the above video, and then implemented the node class, broke it, and discussed it. Then we started implementing the list class and started breaking it, but ran out of time to discuss it.

In addition to discussion the data structure, we also talked about general object oriented programming skills. Some of the things that we discussed were

  1. Classes (documentation)
  2. Constructors in classes >> def __init__(self, x, y)
  3. Why the linked list adds new nodes to beginning and not the end (From Josh: "hint: much more efficient... which is what our discussion was getting at")

[02-23-17] Meeting Recap

We met in the Brown Room in Social sciences. We completed implementing the linked lists from scratch and then started to implement the integrate function.

Implementation in Python

Jupyter notebook implementation. The following provides the node class and the linked list class

Node Class

The following node class is a an object that holds (1) data and (2) a pointer to the next node. 

class Node(object):
    
    ## Constructor: see [1] 
    ## specifies the components of an object 
    ## (the stuff to put in a box)
    
    def __init__(self, d, n = None):
        ## single underscore v double underscore in Python
        ## double = only used in this class 
        ## (totally private variable or function)
        self.data = d
        self.next_node = n
        
    def get_next(self):
        ## great stackoverflow explaining use of "self" [2]
        return self.next_node
    
    def set_next(self, n):
        self.next_node = n
        
    def get_data(self):
        return self.data
    
    def set_data(self, d):
        self.data = d
    
    def __str__(self):
        print self.data
        print self.next_node

##[1] https://www.tutorialspoint.com/python/python_classes_objects.htm
##[2] http://stackoverflow.com/questions/68282/why-do-you-need-explicitly-have-the-self-argument-into-a-python-method

Linked List

The linked list uses the node class, and connects a group of nodes with methods.

class LinkedList(object):
    
    def __init__(self, r = None):
        self.root = r
        self.size = 0
        
    def get_size(self):
        return self.size
    
    def get_root(self):
        return self.root
    
    def add(self, d): 
        ## the add function takes in data;
        ## it creates a new node;
        ## it then puts this data into a node;
        ## and appends the node to end of the linked list
        new_node = Node(d, self.root)
        self.root = new_node
        self.size += 1
        ## the way this list works is it adds itself to the beginning
        ## and makes itself the root
        ## and adds itself as the root
        ## then it increments the list's size by 1
        
    def remove(self, d):
        ## start by setting the root node as the first placeholder
        ## set the previous node to node because presumably the root is #1
        this_node = self.root
        prev_node = None
        
        ## while this_node is true
        ## evaluate this_nodes data, if it matches
        ## and there is a previous node, set the previous node's
        ## next node to this_node's next node
        ## otherwise, just change the root if there isn't a previous node
        while this_node:
            if this_node.get_data() == d:
                if prev_node:
                    prev_node.set_next(this_node.get_next())
                else:
                    self.root = this_node
                    
                ## if this conditional was met, then the size
                ## will be one less and an object was found; return true
                self.size -= 1
                return True
            else:
                ## if the previous conditional wasn't met, grab the next node
                ## and repeat the process
                prev_node = this_node
                this_node = this_node.get_next()
        ## if the while loop breaks with this_node = FALSE,
        ## then the data element did not exist in the list
        return false
    
    def find(self, d):
        this_node = self.root
        while this_node:
            if this_node.get_data() == d:
                ## if it matches, return data
                return d
            else:
                ## if nothing found, move onto the next node
                this_node = this_node.get_next()
        return None
    
    
    def reverse(self):
        #reverses the order of the linked list
        this_node = self.root
        reverse = LinkedList()
        while this_node:
            #starts with last node added (adds to front of list)
            reverse.add(this_node.get_data())
            this_node = this_node.get_next()
            
        return reverse
    
    def integrate(self):
        ## returns the integral of the linked list
        this_node = self.root
        integrated = LinkedList()
        value = 0
        while this_node:
            value = value + this_node.get_data()
            integrated.add(value)
            this_node = this_node.get_next()
            
        return integrated
                
    def print_data(self):
        this_node = self.root
        while this_node:
            print this_node.get_data() 
            this_node = this_node.get_next()