Cleaning data (part 2)

Slide 1: Cleaning data (part 2)

Advanced cleaning with OpenRefine link for slides: http://bit.ly/bu-cleaning-data-p2

Slide 2: First steps:

Make sure you’ve downloaded and installed OpenRefine: http://openrefine.org/download.html

1) Go to the Downloads page 2) Select the appropriate application for your computer

See more information on downloading the OpenRefine software here: https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions

Slide 3: What we’ll cover

First things first (software installation check)
Web technologies and non-rectangular data types
Wrectangling your data
Make it reproducible
Share your work!

Many times, the data we encounter on the web (and other places) is in a format we can’t directly use, so we need to make some changes to the structure to make it more useful. The internet is full of all kinds of data types, and we want to be able to access, clean, format, restructure, visualize, and analyze all of it. OpenRefine is a tool that helps us take in a variety of data formats and clean it into something we can use.

Slide 4: So much data work to do…

Today, we capture and store a massive amount of data electronically. But often, the format and structure for capturing and measuring data are not ideal for creating visualizations or performing analyses. This situation means we need to clean data (prepare them) for analysis.

Slide 5: What is/where are data?

Data are everywhere we look, and they’re being used to measure nearly anything we can imagine

Numbers, text, pictures, audio, videos–there’s no limit to what we can capture.

Data get collected from apps, devices, sensors, scanned documents, the internet, and relational databases. Storage like this usually results in a variety of file formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.

Slide 5: But cleaning data isn’t sexy

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Slide 6: Fundamentals – File Types and Structure

As we’ve noted, the information we find on the web (and other places) is in a format we can’t directly use. In the next few slides, we’re going to present what these formats are, why people use them, and how we can convert them into something we can analyze.

Slide 8: Data come in a variety of file formats

Someone sends you data in a downloadable file

You log into an interactive front-end and retrieve the data from a storage system (i.e., an MS SQL server database system)

The data are available via a continuous stream that’s capturing web traffic (i.e., social media)

You access the data through an Application Programming Interface (API)

Each of these methods might create a different file type, so it’s essential to know the strength and weaknesses of each. A single file that gets emailed to you is handy because you can always return to the original email to retrieve it. However, more extensive data sources require you to interact with them via a virtual machine or desktop (VMware or VirtualBox). Sometimes there is a constant stream of data coming in from users, like on a social media application or online business. Finally, sometimes, companies or organizations have a portal for accessing data called an Application Programming Interface (API).

Slide 9: Types of files: text files

Computer files generally belong to two broad groups, commonly called text files and binary files:

Files like web pages, computer source code, and open-source programming languages are all text files

These files can be opened in a text editor (Notepad, Text Edit, etc.) or via the command line

Humans and computers can read these files

These files consist of only text characters (UTF-8 and UTF-16)

Editing them can be difficult without an understanding of the particular format, language, or structure of the text itself.

Text files are our friends for lots of reasons, but the most important being that they are more accessible to our collaborators (and future selves!).

Slide 10: Types of files: binary files

Binary files require software to open them and are typically not human-readable

Files like MS Word, Excel, and other proprietary software files

Executable files like application installation files (.dmgs or .exe)

Media files like .png, .jpg .mp4 or .mov files

Encryption or compression files (.zip or .rar)

Humans can’t read these files

These files need software to open and edit, and usually can’t be viewed via the command line. Most media files, database files, executables, and zipped files are binary files.
You can view plain text files, but you probably need to know the language if you’re going to edit or create these files.

Slide 11: Rectangular files (spreadsheets)

These are files stored in columns and rows (or variables and observations)

Rectangular files usually require spreadsheet software (Excel or OpenOffice) or relational database software.

Understanding how to access, manipulate, and extract the information in rectangular files (i.e., spreadsheets) is an essential part of data work because of how ubiquitous tabular data are. An absolute must-read for data in spreadsheets is “Data Organization in Spreadsheets” by Karl Broman and Kara Woo.

Slide 12: Non-rectangular files (JSON & XML)

Non-rectangular data files (like JSON and XML) are commonly used on the web for storing data.

JSON was (JavaScript Object Notation) created in 2002, and used for data storage and transfer

XML (extensible markup language) is slightly older technology (created in 1996), but still widely used for the same purpose

Internet technologies commonly use non-rectangular data files (like JSON and XML). JSON (pronounced Jay-SON) is the JavaScript Object Notation format, and it looks a little like a list (if you’ve seen these before). XML (or extensible markup language) is a data description language that’s useful for sharing information between different software systems.

Slide 13: JSON objects

JSON is an object notation language and stores data in objects and arrays

Why would anyone store data this way? JSON can store data and the attributes of the data within the same object.

JSON data is a general-purpose, lightweight format, widely accessible, and relatively simple. You’ll find JSON used directly in JavaScript code for Web pages, and it’s the file format of choice for many APIs.
JSON Data Types: null, true, false, number, string
JSON Data Containers: square brackets [ ], curly brackets { }
JSON objects (in contrast to rectangular database models) can store the set of attributes for an object within the object. JSON objects have a flexible representation because they don’t confine your data to columns and rows (now you can include text, images, etc.).
However, these objects typically need to be coerced into a rectangular object before analyzing.

Slide 14: XML objects

XML (eXtensible Markup Language) is a language used to encode web documents and data structures
XML is also useful for transmitting information between different software systems and through APIs
XML was designed to be “self-descriptive” and carry data in a readable format

XML is also useful for transmitting information between different software systems and through APIs. You can read more about the XML standard here on W3Schools.

This is an example of a complete XML document:

<?xml version="1.0"?> 
<temperatures>
      <filename>ISCCPMonthly_avg.nc</filename>
      <case date="16-JAN-1994"
            temperature="278.9"/> 
</temperatures>

‘Readable’ here means by humans and machines, and the difference between XML and HTML is that XML is used to transport information, while HTML is used to display information.

Slide 15: Why bother with these technologies?

These are weird formats–why do we need to spend all this time learning how to use them?

Slide 16: API: Application Programming Interface

An API is the set of instructions for accessing or transmitting information between software applications
JSON and XML are usually how data are transferred in/out of APIs

Example: I Know Where Your Cat Lives.

From about,

“I Know Where Your Cat Lives iknowwhereyourcatlives.com is a data visualization experiment that locates a sample of one million public images of cats on a world map by the latitude and longitude coordinates embedded in their metadata. The cats were accessed via publicly available APIs provided by popular photo sharing websites. The photos were then run through various clustering algorithms using a supercomputer in order to represent the enormity of the data source.”

Slide 17: API: Application Programming Interface

1) Photos from Flickr and Instagram get stored in an API

2) Requests are made for certain photos with GPS data

I know Where Your Cat Lives is a project that uses the APIs of Flickr and Instagram to access certain photos (i.e. photos of cats, or a specific landmark) with GPS data (in something called an Exif record).

Slide 18: API: Application Programming Interface

3) Combine image meta data with GPS data from OpenStreetMap API

*Example: I Know Where Your Cat Lives *

I know Where Your Cat Lives is a project that uses the APIs of Flickr and Instagram to access certain photos (i.e. photos of cats, or a specific landmark) with GPS data (in something called an Exif record).

GPS data is imported via a map application API, then these photos are placed on a map according to the metadata stored in their photo.

Slide 19: API: Application Programming Interface

4) Use Google Maps to display the image on the satellite image

*Example: I Know Where Your Cat Lives *

I know Where Your Cat Lives is a project that uses the APIs of Flickr and Instagram to access certain photos (i.e. photos of cats, or a specific landmark) with GPS data (in something called an Exif record).

GPS data is imported via a map application API, then these photos are placed on a map according to the metadata stored in their photo.

These images and maps are then loaded into an application that load the two at the same time.

Slide 20: A Quick Example

Slide 21: Load Address Data into OpenRefine

Navigate to this link and load the addresses into OpenRefine:

Raw CSV address data

Create a new project called us-addresses

APIs let you combine data from multiple sources, and there really is no limit to the amount of available API data. Just to list a few:

We will be using the OpenMapQuest API to access longitude and latitude information from a .csv file of 234 addresses in Massachusetts and New York.

These addresses are in two columns, street and city_state_zip.

Slide 22: Create full_address column

Use the following GREL code to create a new full_address column:

value + ", " + cells["city_state_zip"].value

Click OK.

Select the little arrow on the street column and click on,

Edit column > Add column based on this column

Here we’re going to use the General Refine Expression Language (GREL), but OpenRefine also allows for Jython (Python implemented in Java), and Clojure (a functional language that resembles Lisp). There are quite a few resources out there for GREL expressions, but I like the one from Illinois Library. We’re going to use the GREL language to prepare the address columns for mapping.

Select the little arrow on the street column and click on Edit column >> Add column based on this column.

In the dialogue box under ‘New column name’, enter the name of the new column (full_address).

In the section for On error, select store error. Under the section for the Expression, enter the GREL expression below:

value + ", " + cells["city_state_zip"].value

This will give us a new column with street, city, state, and zip.

Slide 23: Arrange columns

Select the little arrow on the new full_address column and click on,

Edit column > Move column to the end

Just to make things a little clearer, we’ll move the new column to the end. Moving columns around gives us the ability to view columns side-by-side, and provides some visible reassurance about the data transformations we’re making.

Slide 24: MapQuest API

Head over to:

https://developer.mapquest.com/

Click on the button to get your Free API Key

Head over to MapQuest and get a Free API Key.

The instructions are fairly straightforward, but you will need to get into the developer dashboard.

Slide 25: Create and Approve MapQuest API Key

You can create a new key or use the one listed under My Application

Click Manage Keys then Approve All Keys

Head over to MapQuest and get a Free API Key.

The instructions are fairly straightforward, but you will need to get into the developer dashboard, and your keys will be under Manage Keys

You can create a new application, or use the one given under My Application. Read more about API keys here.

Slide 26: APIs with OpenRefine

OpenRefine works with GET request APIs.

Slide 27: Example MapQuest API query

Learning the components of API requests is an important part of accessing and gathering data. Most sites have at least a little documentation about how to access the data available on their API. When they don’t, fortunately all requests have similar components and syntax.

The parts of an API request are in the figure above. You can also read more on the documentation website here.

Slide 28: Example API query parameters with GREL

Click on the new full_address column and select, Add column by fetching URLs

Follow the steps in the diagram to the right, but refer to the notes below for the GREL expression:

After you’ve entered the Expression, check the 1st row in the Preview section at the bottom of the dialogue box.

Now we will create another column using OpenRefine’s data enrichment abilities. OpenRefine gives us the ability to load various data APIs into our existing projects.

New column name = mapquest_locations

On error = store error

Throttle delay = 1000 milliseconds

Expression = 'http://open.mapquestapi.com/nominatim/v1/search.php?'+ 'format=json&' + 'key=JWbcrLmu5UfJ9n1krjAMdr0jz3QIG0ha&'+ 'q='+ escape(value, "url")

Slide 29: Example API query parameters with GREL

This takes time!

It took me a few minutes to run this API request, so expect to wait a bit for the data to load into OpenRefine.

Slide 30: JSON data

When you’re done, you should see the following JSON data in the new mapquest_locations column

After OpenRefine is finished loading the data into mapquest_locations, you’ll see the JSON data is all crammed into a single row. This makes it very hard to see what’s in this new column, so we’ll need to parse the JSON data into something a little easier to work with.

Slide 31: Parsing JSON data in OpenRefine

Click on the new mapquest_locations column and select Edit column > Add column based on this column…

In the dialogue box, enter the following options:

Name new column = latitude
On error = store error
Expression = value.parseJson()[0].lat
Verify this is correct in the Preview pane

Parsing JSON data in OpenRefine is made simpler using GREL’s parseJson() function. We can enter the following expression into the Expression box,

value.parseJson()[0].lat

Then we name our new column (latitude), and select store error (this is a good practice in case there was a typo or anything that needs to be debugged). We also want to check out the contents in the Preview pane at the bottom of the dialogue window (make sure it looks like latitude data).

Repeat the same steps on the mapquest_locations column, but this time create a longitude column

Slide 32: Parsing JSON data in OpenRefine

Repeating the process is exactly the same as the process above, but I’m going to change lat to lon

value.parseJson()[0].lon

Check to make sure this code is working, then click “OK”.

Slide 33: Save each step as a JSON file

Click on the Undo/ Redo icon next to Facet / Filter

Identify the steps needed to create this data set (6 of them)

I want to make sure and save all the steps I just took to create these new columns, so I’m going to use OpenRefine’s Undo/Redo section.

Slide 34: Save each step as a JSON file

I want to make sure and save all the steps I just took to create these new columns, so I’m going to use OpenRefine’s Undo/Redo section.

Make sure to save the contents to a file that you can change the extension to “.json”

Store this in the same folder as the “us-addresses.csv” file with some description.

Click on the Extract button (next to Apply)
Make sure all the check the boxes!

Copy + Paste this into a file and save for the next project!

[
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "baseColumnName": "street",
    "expression": "grel:value + \", \" + cells[\"city_state_zip\"].value",
    "onError": "store-error",
    "newColumnName": "full_address",
    "columnInsertIndex": 1,
    "description": "Create column full_address at index 1 based on column street using expression grel:value + \", \" + cells[\"city_state_zip\"].value"
  }
]

Slide 35: Check out these links to learn more!

John R Little has a great Pragmatic Datafication workshop with some great materials, see the slides and workbook.
Check out the OpenRefine GREL documentation on Github (lots of great tidbits in here!)
Good ’ol fashioned YouTube! Check out the following video to learn more:

Check out these slides and the github page for more information!

cleaning-data-part2

Advanced workshop for cleaning data in OpenRefine

Cleaning data (part 2)

Slide 1: Cleaning data (part 2)

Slide 2: First steps:

Slide 3: What we’ll cover

Slide 4: So much data work to do…

Slide 5: What is/where are data?

Slide 5: But cleaning data isn’t sexy

Slide 6: Fundamentals – File Types and Structure

Slide 8: Data come in a variety of file formats

Slide 9: Types of files: text files

Slide 10: Types of files: binary files

Slide 11: Rectangular files (spreadsheets)

Slide 12: Non-rectangular files (JSON & XML)

Slide 13: JSON objects

Slide 14: XML objects

Slide 15: Why bother with these technologies?

Slide 16: API: Application Programming Interface

Slide 17: API: Application Programming Interface

Slide 18: API: Application Programming Interface

Slide 19: API: Application Programming Interface

Slide 20: A Quick Example

Slide 21: Load Address Data into OpenRefine

Slide 22: Create full_address column

Slide 23: Arrange columns

Slide 24: MapQuest API

Slide 25: Create and Approve MapQuest API Key

Slide 26: APIs with OpenRefine

Slide 27: Example MapQuest API query

Slide 28: Example API query parameters with GREL

Slide 29: Example API query parameters with GREL

Slide 30: JSON data

Slide 31: Parsing JSON data in OpenRefine

Slide 32: Parsing JSON data in OpenRefine

Slide 33: Save each step as a JSON file

Slide 34: Save each step as a JSON file

Slide 35: Check out these links to learn more!