Saturday, December 31, 2016
Friday, December 30, 2016
Thursday, December 29, 2016
Creating a Custom WordPress Messaging System, Part 4
In this series, we've taken a look at how we can implement a system that allows us to programmatically define custom messages that display on a given administration page in the WordPress back end.
If you've followed along with the series thus far, then you know:
- We've laid the groundwork for the plugin that's used throughout this series, and even developed it a bit further.
- We've defined and used a custom hook that we can use to render the settings messages.
- We've added support for success, warning, and error messages that can be rendered at the top of a given settings page.
As mentioned in the previous tutorial:
But if you've read any of my previous tutorials, you know that I'm not a fan of having duplicated code. Nor am I fan of having one class do many things. And, unfortunately, that's exactly that we're doing here.
And we're going to address that in this final tutorial. By the end, we'll have a complete refactored solution that uses some intermediate object-oriented principles like inheritance. We'll also have a few methods that we can use programmatically or that can be registered with the WordPress hook system.
Getting Started at the End
At this point you should know exactly what you need in your local development environment. Specifically, you should have the following:
- PHP 5.6.25 and MySQL 5.6.28
- Apache or Nginx
- WordPress 4.6.1
- Your preferred IDE or editor
I also recommend the most recent version of the source code as it will allow you to walk through all of the changes that we're going to make. If you don't have it, that's okay, but I recommend reading back over the previous tutorials before going any further.
In the Previous Tutorial
As you may recall (or have ascertained from the comment above), the previous tutorial left us with a single class that was doing too much work.
One way to know this is that if you were to describe what the class was doing, you wouldn't be able to give a single answer. Instead, you'd have to say that it was responsible for handling success messages, warning messages, error messages, and rendering all of them independently of one another.
And though you might make the case that it was "managing custom messages," you wouldn't necessarily be describing just how verbose the class was. That's what we hope to resolve in this tutorial.
In the Final Tutorial
Specifically, we're going to be looking at doing the following:
- removing the old settings messenger class
- adding a new, more generic settings message class
- adding a settings messenger class with which to communicate
- introducing methods that we can use independent of WordPress
- streamlining how WordPress renders the messages
We have our work cut out for us, so let's go ahead and get started with all of the above.
Refactoring Our Work
When it comes to refactoring our work, it helps to know exactly what it is that we want to do. In our case, we recognize that we have a lot of duplicate code that could be condensed.
Furthermore, we have three different types of messages managed in exactly the same way save for how they are rendered. And in that instance, it's an issue of the HTML class attributes.
Thus, we can generalize that code to focus on a specific type, and we can consolidate a lot of the methods for adding success messages or retrieving error messages by generalizing a method to recognize said type.
Ultimately, we will do that. But first, some housekeeping.
1. Remove the Old Settings Messenger
In the previous tutorials, we've been working with a class called Settings_Messenger
. Up to this point, it has served its purpose, but we're going to be refactoring this class throughout the remainder of this tutorial.
When it comes to this type of refactoring, it's easy to want to simply delete the class and start over. There are times in which this is appropriate, but this is not one of them. Instead, we're going to take that class and refactor what's already there.
All of that to say, don't delete the file and get started with a new one. Instead, track with what we're doing throughout this tutorial.
2. A New Setting Message Class
First, let's introduce a Settings_Message
class. This represents any type of settings message with which we're going to write. That is, it will manage success messages, error messages, and warning messages.
To do this, we'll define the class, introduce a single property, and then we'll instantiate it in the constructor. Check out this code, and I'll explain a bit more below:
<?php class Settings_Message { private $messages; public function __construct() { $this->messages = array( 'success' => array(), 'error' => array(), 'warning' => array(), ); } }
Notice that we've created a private attribute, $messages
. When the class is instantiated, we create a multidimensional array. Each index, identified either by success
, error
, or warning
, refers to its own array in which we'll be storing the corresponding messages.
Next, we need to be able to add a message, get a message, and get all of the messages. I'll discuss each of these in more detail momentarily.
Adding Messages
First, let's look at how we're adding messages:
<?php public function add_message( $type, $message ) { $message = sanitize_text_field( $message ); if ( in_array( $message, $this->messages[ $type ] ) ) { return; } array_push( $this->messages[ $type ], $message ); }
This message first takes the incoming string and sanitizes the data. Then it checks to see if it already exists in the success messages. If so, it simply returns. After all, we don't want duplicate messages.
Otherwise, it adds the message to the collection.
Getting Messages
Retrieving messages comes in two forms:
- rendering individual messages by type
- rendering the messages in the display of the administration page (complete with HTML sanitization, etc.)
Remember, there are times where we may only want to display warning messages. Other times, we may want to display all of the messages. Since there are two ways of doing this, we can leverage one and then take advantage of it in other another function.
Sound confusing? Hang with me and I'll explain all of it. The first part we're going to focus on is how to render messages by type (think success, error, or warning). Here's the code for doing that (and it should look familiar):
<?php public function get_messages( $type ) { if ( empty( $this->messages[ $type ] ) ) { return; } $html = "<div class='notice notice-$type is-dismissible'>"; $html .= '<ul>'; foreach ( $this->messages[ $type ] as $message ) { $html .= "<li>$message</li>"; } $html .= '</ul>'; $html .= '</div><!-- .notice-$type -->'; $allowed_html = array( 'div' => array( 'class' => array(), ), 'ul' => array(), 'li' => array(), ); echo wp_kses( $html, $allowed_html ); }
Notice here that we're using much of the same code from the previous tutorial; however, we've generalized it so that it looks at the incoming $type
and dynamically applies it to the markup.
This allows us to have a single function for rendering our messages. This isn't all, though. What about the times we want to get all messages? This could be to render on a page or to grab them programmatically for some other processing.
To do this, we can introduce another function:
<?php public function get_all_messages() { foreach ( $this->messages as $type => $message ) { $this->get_messages( $type ); } }
This message should be easy enough to understand. It simply loops through all of the messages we have in our collection and calls the get_messages
function we outlined above.
It still renders them all together (which we'll see one use of them in our implementation of a custom hook momentarily). If you wanted to use them for another purpose, you could append the result into a string and return it to the caller, or perform some other programmatic function.
This is but one implementation.
3. The Settings Messenger
That does it for the actual Settings_Message
class. But how do we communicate with it? Sure, we can talk to it directly, but if there's an intermediate class, we have some control over what's returned to us without adding more responsibility to the Settings_Message
class, right?
Enter the Settings_Messenger
. This class is responsible for allows us to read and write settings messages. I think a case could be made that you could split this up into two classes by its responsibility because it both reads and writes but, like a messenger who sends and receives, that's the purpose of this class.
The initial setup of the class is straightforward.
- The constructor creates an instance of the
Settings_Message
class that we can use to send and receive messages. - It associates a method with our custom
tutsplus_settings_messages
hook we defined in a previous tutorial.
Take a look at the first couple of methods:
<?php class Settings_Messenger { private $message; public function __construct() { $this->message = new Settings_Message(); } public function init() { add_action( 'tutsplus_settings_messages', array( $this, 'get_all_messages' ) ); } }
Remember from earlier in this tutorial, we have the hook defined in our view which can be found in settings.php
. For the sake of completeness, it's listed here:
<div class="wrap"> <h1><?php echo esc_html( get_admin_page_title() ); ?></h1> <?php do_action( 'tutsplus_settings_messages' ); ?> <p class="description"> We aren't actually going to display options on this page. Instead, we're going to use this page to demonstration how to hook into our custom messenger. </p><!-- .description --> </div><!-- .wrap -->
Notice, however, that this particular hook takes advantage of the get_all_messages
method we'll review in a moment. It doesn't have to use this method. Instead, it could be used to simply render success messages or any other methods that you want to use.
Adding Messages
Creating the functions to add messages is simple as these functions require a type and the message itself. Remember, the Settings_Message
takes care of sanitizing the information so we can simply pass in the incoming messages.
See below where we're adding success, warning, and error messages:
<?php public function add_success_message( $message ) { $this->add_message( 'success', $message ); } public function add_warning_message( $message ) { $this->add_message( 'warning', $message ); } public function add_error_message( $message ) { $this->add_message( 'error', $message ); }
It's easy, isn't it?
Getting Messages
Retrieving messages isn't much different except we just need to provide the type of messages we want to retrieve:
<?php public function get_success_messages() { echo $this->get_messages( 'success' ); } public function get_warning_messages() { echo $this->get_messages( 'warning' ); } public function get_error_messages() { echo $this->get_messages( 'error' ); }
Done and done, right?
But Did You Catch That?
Notice that the messages above all refer to two other methods we haven't actually covered yet. These are private messages that help us simplify the calls above.
Check out the following private methods both responsible for adding and retrieving messages straight from the Settings_Message
instance maintained on the messenger object:
<?php private function add_message( $type, $message ) { $this->message->add_message( $type, $message ); } private function get_messages( $type ) { return $this->message->get_messages( $type ); }
And that wraps up the new Settings_Messenger
class. All of this is much simpler, isn't it?
Starting the Plugin
It does raise the question, though: How do we start the plugin now that we've had all of these changes?
See the entire function below:
<?php add_action( 'plugins_loaded', 'tutsplus_custom_messaging_start' ); /** * Starts the plugin. * * @since 1.0.0 */ function tutsplus_custom_messaging_start() { $plugin = new Submenu( new Submenu_Page() ); $plugin->init(); $messenger = new Settings_Messenger(); $messenger->init(); $messenger->add_success_message( 'Nice shot kid, that was one in a million!' ); $messenger->add_warning_message( 'Do not go gently into that good night.' ); $messenger->add_error_message( 'Danger Will Robinson.' ); }
And that's it.
A few points to note:
- If you don't call init on the
Settings_Messenger
, then you don't have to worry about displaying any messages in on your settings page. - The code adds messages to the
Settings_Messenger
, but it doesn't actually retrieve any because I am using the init method. - If you want to retrieve the messages then you can use the methods we've outlined above.
That's all for the refactoring. This won't work exactly out of the box as there is still some code needed to load all of the PHP files required to get the plugin working; however, the code above focuses on the refactoring which is the point of this entire tutorial.
Conclusion
For a full working version of this tutorial and complete source code that does work out of the box, please download the source code attached to this post on the right sidebar.
I hope that over the course of this material you picked up a number of new skills and ways to approach WordPress development. When looking over the series, we've covered a lot:
- custom menus
- introducing administration pages
- the various message types
- defining and leveraging custom hooks
- and refactoring object-oriented code
As usual, I'm also always happy to answer questions via the comments, and you can also check out my blog and follow me on Twitter. I usually talk all about software development within WordPress and tangential topics, as well. If you're interested in more WordPress development, don't forget to check out my previous series and tutorials, and the other WordPress material we have here on Envato Tuts+.
Resources
- Creating Custom Administration Pages with WordPress
- The WordPress Settings API
- How to Get Started With WordPress
- add_action
- do_action
- wp_kses
- sanitize_text_field
Wednesday, December 28, 2016
Tuesday, December 27, 2016
Monday, December 26, 2016
Uploading Files With Rails and Shrine
There are many file uploading gems out there like CarrierWave, Paperclip, and Dragonfly, to name a few. They all have their specifics, and probably you've already used at least one of these gems.
Today, however, I want to introduce a relatively new, but very cool solution called Shrine, created by Janko Marohnić. In contrast to some other similar gems, it has a modular approach, meaning that every feature is packed as a module (or plugin in Shrine's terminology). Want to support validations? Add a plugin. Wish to do some file processing? Add a plugin! I really love this approach as it allows you to easily control which features will be available for which model.
In this article I am going to show you how to:
- integrate Shrine into a Rails application
- configure it (globally and per-model)
- add the ability to upload files
- process files
- add validation rules
- store additional metadata and employ file cloud storage with Amazon S3
The source code for this article is available on GitHub.
The working demo can be found here.
Integrating Shrine
To start off, create a new Rails application without the default testing suite:
rails new FileGuru -T
I will be using Rails 5 for this demo, but most of the concepts apply to versions 3 and 4 as well.
Drop the Shrine gem into your Gemfile:
gem "shrine"
Then run:
bundle install
Now we will require a model that I am going to call Photo
. Shrine stores all file-related information in a special text column ending with a _data
suffix. Create and apply the corresponding migration:
rails g model Photo title:string image_data:text rails db:migrate
Note that for older versions of Rails, the latter command should be:
rake db:migrate
Configuration options for Shrine can be set both globally and per-model. Global settings are done, of course, inside the initializer file. There I am going to hook up the necessary files and plugins. Plugins are used in Shrine to extract pieces of functionality into separate modules, giving you full control of all the available features. For example, there are plugins for validation, image processing, caching attachments, and more.
For now, let's add two plugins: one to support ActiveRecord and another one to set up logging. They are going to be included globally. Also, set up file system storage:
config/initializers/shrine.rb
require "shrine" require "shrine/storage/file_system" Shrine.plugin :activerecord Shrine.plugin :logging, logger: Rails.logger Shrine.storages = { cache: Shrine::Storage::FileSystem.new("public", prefix: "uploads/cache"), store: Shrine::Storage::FileSystem.new("public", prefix: "uploads/store"), }
Logger will simply output some debugging information inside the console for you saying how much time was spent to process a file. This can come in handy.
2015-10-09T20:06:06.676Z #25602: STORE[cache] ImageUploader[:avatar] User[29543] 1 file (0.1s) 2015-10-09T20:06:06.854Z #25602: PROCESS[store]: ImageUploader[:avatar] User[29543] 1-3 files (0.22s) 2015-10-09T20:06:07.133Z #25602: DELETE[destroyed]: ImageUploader[:avatar] User[29543] 3 files (0.07s)
All uploaded files will be stored inside the public/uploads directory. I don't want to track these files in Git, so exclude this folder:
.gitignore
public/uploads
Now create a special "uploader" class that is going to host model-specific settings. For now, this class is going to be empty:
models/image_uploader.rb
class ImageUploader < Shrine end
Lastly, include this class inside the Photo
model:
models/photo.rb
include ImageUploader[:image]
[:image]
adds a virtual attribute that will be used when constructing a form. The above line can be rewritten as:
include ImageUploader.attachment(:image) # or include ImageUploader::Attachment.new(:image)
Nice! Now the model is equipped with Shrine's functionality, and we can proceed to the next step.
Controller, Views, and Routes
For the purposes of this demo, we'll need only one controller to manage photos. The index
page will serve as the root:
pages_controller.rb
class PhotosController < ApplicationController def index @photos = Photo.all end end
The view:
views/photos/index.html.erb
<h1>Photos</h1> <%= link_to 'Add Photo', new_photo_path %> <%= render @photos %>
In order to render the @photos
array, a partial is required:
views/photos/_photo.html.erb
<div> <% if photo.image_data? %> <%= image_tag photo.image_url %> <% end %> <p><%= photo.title %> | <%= link_to 'Edit', edit_photo_path(photo) %></p> </div>
image_data?
is a method presented by Shrine that checks whether a record has an image.
image_url
is yet another Shrine method that simply returns a path to the original image. Of course, it is much better to display a small thumbnail instead, but we will take care of that later.
Add all the necessary routes:
config/routes.rb
resources :photos, only: [:new, :create, :index, :edit, :update] root 'photos#index'
This is it—the groundwork is done, and we can proceed to the interesting part!
Uploading Files
In this section I will show you how to add the functionality to actually upload files. The controller actions are very simple:
photos_controller.rb
def new @photo = Photo.new end def create @photo = Photo.new(photo_params) if @photo.save flash[:success] = 'Photo added!' redirect_to photos_path else render 'new' end end
The only gotcha is that for strong parameters you have to permit the image
virtual attribute, not the image_data
.
photos_controller.rb
private def photo_params params.require(:photo).permit(:title, :image) end
Create the new
view:
views/photos/new.html.erb
<h1>Add photo</h1> <%= render 'form' %>
The form's partial is also trivial:
views/photos/_form.html.erb
<%= form_for @photo do |f| %> <%= render "shared/errors", object: @photo %> <%= f.label :title %> <%= f.text_field :title %> <%= f.label :image %> <%= f.file_field :image %> <%= f.submit %> <% end %>
Once again, note that we are using the image
attribute, not the image_data
.
Lastly, add another partial to display errors:
views/shared/_errors.html.erb
<% if object.errors.any? %> <h3>The following errors were found:</h3> <ul> <% object.errors.full_messages.each do |message| %> <li><%= message %></li> <% end %> </ul> <% end %>
This is pretty much all—you can start uploading images right now.
Validations
Of course, much more work has to be done in order to complete the demo app. The main problem is that the users may upload absolutely any type of file with any size, which is not particularly great. Therefore, add another plugin to support validations:
config/inititalizers/shrine.rb
Shrine.plugin :validation_helpers
Set up the validation logic for the ImageUploader
:
models/image_uploader.rb
Attacher.validate do validate_max_size 1.megabyte, message: "is too large (max is 1 MB)" validate_mime_type_inclusion ['image/jpg', 'image/jpeg', 'image/png'] end
I am permitting only JPG and PNG images less than 1MB to be uploaded. Tweak these rules as you see fit.
MIME Types
Another important thing to note is that, by default, Shrine will determine a file's MIME type using the Content-Type HTTP header. This header is passed by the browser and set only based on the file's extension, which is not always desirable.
If you wish to determine the MIME type based on the file's contents, then use a plugin called determine_mime_type. I will include it inside the uploader class, as other models may not require this functionality:
models/image_uploader.rb
plugin :determine_mime_type
This plugin is going to use Linux's file utility by default.
Caching Attached Images
Currently, when a user sends a form with incorrect data, the form will be displayed again with errors rendered above. The problem, however, is that the attached image will be lost, and the user will need to select it once again. This is very easy to fix using yet another plugin called cached_attachment_data:
models/image_uploader.rb
plugin :cached_attachment_data
Now simply add a hidden field into your form.
views/photos/_form.html.erb
<%= f.hidden_field :image, value: @photo.cached_image_data %> <%= f.label :image %> <%= f.file_field :image %>
Editing a Photo
Now images can be uploaded, but there is no way to edit them, so let's fix it right away. The corresponding controller's actions are somewhat trivial:
photos_controller.rb
def edit @photo = Photo.find(params[:id]) end def update @photo = Photo.find(params[:id]) if @photo.update_attributes(photo_params) flash[:success] = 'Photo edited!' redirect_to photos_path else render 'edit' end end
The same _form
partial will be utilized:
views/photos/edit.html.erb
<h1>Edit Photo</h1> <%= render 'form' %>
Nice, but not enough: users still can't remove an uploaded image. In order to allow this, we'll need—guess what—another plugin:
models/image_uploader.rb
plugin :remove_attachment
It uses a virtual attribute called :remove_image
, so permit it inside the controller:
photos_controller.rb
def photo_params params.require(:photo).permit(:title, :image, :remove_image) end
Now just display a checkbox to remove an image if a record has an attachment in place:
views/photos/_form.html.erb
<% if @photo.image_data? %> Remove attachment: <%= f.check_box :remove_image %> <% end %>
Generating a Thumbnail Image
Currently we display original images, which is not the best approach for previews: photos may be large and occupy too much space. Of course, you could simply employ the CSS width
and height
attributes, but that's a bad idea as well. You see, even if the image is set to be small using styles, the user will still need to download the original file, which might be pretty big.
Therefore, it is much better to generate a small preview image on the server side during the initial upload. This involves two plugins and two additional gems. Firstly, drop in the gems:
gem "image_processing" gem "mini_magick", ">= 4.3.5"
Image_processing is a special gem created by the author of Shrine. It presents some high-level helper methods to manipulate images. This gem, in turn, relies on mini_magick, a Ruby wrapper for ImageMagick. As you've guessed, you'll need ImageMagick on your system in order to run this demo.
Install these new gems:
bundle install
Now include the plugins along with their dependencies:
models/image_uploader.rb
require "image_processing/mini_magick" class ImageUploader < Shrine include ImageProcessing::MiniMagick plugin :processing plugin :versions # other code... end
Processing is the plugin to actually manipulate an image (for example, shrink it, rotate, convert to another format, etc.). Versions, in turn, allows us to have an image in different variants. For this demo, two versions will be stored: "original" and "thumb" (resized to 300x300
).
Here is the code to process an image and store its two versions:
models/image_uploader.rb
class ImageUploader < Shrine process(:store) do |io, context| { original: io, thumb: resize_to_limit!(io.download, 300, 300) } end end
resize_to_limit!
is a method provided by the image_processing gem. It simply shrinks an image down to 300x300
if it is larger and does nothing if it's smaller. Moreover, it keeps the original aspect ratio.
Now when displaying the image, you just need to provide either the :original
or :thumb
argument to the image_url
method:
views/photos/_photo.html.erb
<div> <% if photo.image_data? %> <%= image_tag photo.image_url(:thumb) %> <% end %> <p><%= photo.title %> | <%= link_to 'Edit', edit_photo_path(photo) %></p> </div>
The same can be done inside the form:
views/photos/_form.html.erb
<% if @photo.image_data? %> <%= image_tag @photo.image_url(:thumb) %> Remove attachment: <%= f.check_box :remove_image %> <% end %>
To automatically delete the processed files after uploading is complete, you may add a plugin called delete_raw:
models/image_uploader.rb
plugin :delete_raw
Image's Metadata
Apart from actually rendering an image, you may also fetch its metadata. Let's, for example, display the original photo's size and MIME type:
views/photos/_photo.html.erb
<div> <% if photo.image_data? %> <%= image_tag photo.image_url(:thumb) %> <p> Size <%= photo.image[:original].size %> bytes<br> MIME type <%= photo.image[:original].mime_type %><br> </p> <% end %> <p><%= photo.title %> | <%= link_to 'Edit', edit_photo_path(photo) %></p> </div>
What about its dimensions? Unfortunately, they are not stored by default, but this is possible with a plugin called store_dimensions.
Image's Dimensions
The store_dimensions plugin relies on the fastimage gem, so hook it up now:
gem 'fastimage'
Don't forget to run:
bundle install
Now just include the plugin:
models/image_uploader.rb
plugin :store_dimensions
And display the dimensions using the width
and height
methods:
views/photos/_photo.html.erb
<div> <% if photo.image_data? %> <%= image_tag photo.image_url(:thumb) %> <p> Size <%= photo.image[:original].size %> bytes<br> MIME type <%= photo.image[:original].mime_type %><br> Dimensions <%= "#{photo.image[:original].width}x#{photo.image[:original].height}" %> </p> <% end %> <p><%= photo.title %> | <%= link_to 'Edit', edit_photo_path(photo) %></p> </div>
Also, there is a dimensions
method available that returns an array containing width and height (for example, [500, 750]
).
Moving to the Cloud
Developers often choose cloud services to host uploaded files, and Shrine does present such a possibility. In this section, I will show you how to upload files to Amazon S3.
As the first step, include two more gems into the Gemfile:
gem "aws-sdk", "~> 2.1" group :development do gem 'dotenv-rails' end
aws-sdk is required to work with S3's SDK, whereas dotenv-rails will be used to manage environment variables in development.
bundle install
Before proceeding, you should obtain a key pair to access S3 via API. To get it, sign in (or sign up) to Amazon Web Services Console and navigate to Security Credentials > Users. Create a user with permissions to manipulate files on S3. Here is the simple policy presenting full access to S3:
{ "Version": "2016-11-14", "Statement": [ { "Effect": "Allow", "Action": "s3:*", "Resource": "*" } ] }
Download the created user's key pair. Alternatively, you might use root access keys, but I strongly discourage you from doing that as it's very insecure.
Next, create an S3 bucket to host your files and add a file into the project's root to host your configuration:
.env
S3_KEY=YOUR_KEY S3_SECRET=YOUR_SECRET S3_BUCKET=YOUR_BUCKET S3_REGION=YOUR_REGION
Never ever expose this file to the public, and make sure you exclude it from Git:
.gitignore
.env
Now modify Shrine's global configuration and introduce a new storage:
config/initializers/shrine.rb
require "shrine" require "shrine/storage/s3" s3_options = { access_key_id: ENV['S3_KEY'], secret_access_key: ENV['S3_SECRET'], region: ENV['S3_REGION'], bucket: ENV['S3_BUCKET'], } Shrine.storages = { cache: Shrine::Storage::S3.new(prefix: "cache", **s3_options), store: Shrine::Storage::S3.new(prefix: "store", **s3_options), }
That's it! No changes have to be made to the other parts of the app, and you can test this new storage right away. If you are receiving errors from S3 related to incorrect keys, make sure you accurately copied the key and secret, without any trailing spaces and invisible special symbols.
Conclusion
We've come to the end of this article. Hopefully, by now you feel much confident in using Shrine and are eager to employ it in one of your projects. We have discussed many of this gem's features, but there are even more, like the ability to store additional context along with files and the direct upload mechanism.
Therefore, do browse Shrine's documentation and its official website, which thoroughly describes all available plugins. If you have other questions left about this gem, don't hesitate to post them. I thank you for staying with me, and I'll see you soon!
Sunday, December 25, 2016
Saturday, December 24, 2016
Friday, December 23, 2016
Thursday, December 22, 2016
Wednesday, December 21, 2016
Tuesday, December 20, 2016
Monday, December 19, 2016
Compressing and Extracting Files in Python
If you have been using computers for some time, you have probably come across files with the .zip extension. They are special files that can hold the compressed content of many other files, folders, and subfolders. This makes them pretty useful for transferring files over the internet. Did you know that you can use Python to compress or extract files?
This tutorial will teach you how to use the zipfile module in Python, to extract or compress individual or multiple files at once.
Compressing Individual Files
This one is easy and requires very little code. We begin by importing the zipfile module and then open the ZipFile object in write mode by specifying the second parameter as 'w'. The first parameter is the path to the file itself. Here is the code that you need:
import zipfile jungle_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\jungle.zip', 'w') jungle_zip.write('C:\\Stories\\Fantasy\\jungle.pdf', compress_type=zipfile.ZIP_DEFLATED) jungle_zip.close()
Please note that I will specify the path in all the code snippets in a Windows style format; you will need to make appropriate changes if you are on Linux or Mac.
You can specify different compression methods to compress files. The newer methods BZIP2
and LZMA
were added in Python version 3.3, and there are some other tools as well which don't support these two compression methods. For this reason, it is safe to just use the DEFLATED
method. You should still try out these methods to see the difference in the size of the compressed file.
Compressing Multiple Files
This is slightly complex as you need to iterate over all files. The code below should compress all files with the extension pdf in a given folder:
import os import zipfile fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w') for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'): for file in files: if file.endswith('.pdf'): fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED) fantasy_zip.close()
This time, we have imported the os
module and used its walk()
method to go over all files and subfolders inside our original folder. I am only compressing the pdf files in the directory. You can also create different archived files for each format using if
statements.
If you don't want to preserve the directory structure, you can put all the files together by using the following line:
fantasy_zip.write(os.path.join(folder, file), file, compress_type = zipfile.ZIP_DEFLATED)
The write()
method accepts three parameters. The first parameter is the name of our file that we want to compress. The second parameter is optional and allows you to specify a different file name for the compressed file. If nothing is specified, the original name is used.
Extracting All Files
You can use the extractall()
method to extract all the files and folders from a zip file into the current working directory. You can also pass a folder name to extractall()
to extract all files and folders in a specific directory. If the folder that you passed does not exist, this method will create one for you. Here is the code that you can use to extract files:
import zipfile fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip') fantasy_zip.extractall('C:\\Library\\Stories\\Fantasy') fantasy_zip.close()
If you want to extract multiple files, you will have to supply the name of files that you want to extract as a list.
Extracting Individual Files
This is similar to extracting multiple files. One difference is that this time you need to supply the filename first and the path to extract them to later. Also, you need to use the extract()
method instead of extractall()
. Here is a basic code snippet to extract individual files.
import zipfile fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip') fantasy_zip.extract('Fantasy Jungle.pdf', 'C:\\Stories\\Fantasy') fantasy_zip.close()
Reading Zip Files
Consider a scenario where you need to see if a zip archive contains a specific file. Up to this point, your only option to do so is by extracting all the files in the archive. Similarly, you may need to extract only those files which are larger than a specific size. The zipfile
module allows us to inquire about the contents of an archive without ever extracting it.
Using the namelist()
method of the ZipFile object will return a list of all members of an archive by name. To get information on a specific file in the archive, you can use the getinfo()
method of the ZipFile object. This will give you access to information specific to that file, like the compressed and uncompressed size of the file or its last modification time. We will come back to that later.
Calling the getinfo()
method one by one on all files can be a tiresome process when there are a lot of files that need to be processed. In this case, you can use the infolist()
method to return a list containing a ZipInfo object for every single member in the archive. The order of these objects in the list is same as that of actual zipfiles.
You can also directly read the contents of a specific file from the archive using the read(file)
method, where file
is the name of the file that you intend to read. To do this, the archive must be opened in read or append mode.
To get the compressed size of an individual file from the archive, you can use the compress_size
attribute. Similarly, to know the uncompressed size, you can use the file_size
attribute.
The following code uses the properties and methods we just discussed to extract only those files that have a size below 1MB.
import zipfile stories_zip = zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip') for file in stories_zip.namelist(): if stories_zip.getinfo(file).file_size < 1024*1024: stories_zip.extract(file, 'C:\\Stories\\Short\\Funny') stories_zip.close()
To know the time and date when a specific file from the archive was last modified, you can use the date_time
attribute. This will return a tuple of six values. The values will be the year, month, day of the month, hours, minutes, and seconds, in that specific order. The year will always be greater than or equal to 1980, and hours, minutes, and seconds are zero-based.
import zipfile stories_zip = zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip') thirsty_crow_info = stories_zip.getinfo('The Thirsty Crow.pdf') print(thirsty_crow_info.date_time) print(thirsty_crow_info.compress_size) print(thirsty_crow_info.file_size) stories_zip.close()
This information about the original file size and compressed file size can help you decide whether it is worth compressing a file. I am sure it can be used in some other situations as well.
Final Thoughts
As evident from this tutorial, using the zipfile
module to compress files gives you a lot of flexibility. You can compress different files in a directory to different archives based on their type, name, or size. You also get to decide whether you want to preserve the directory structure or not. Similarly, while extracting the files, you can extract them to the location you want, based on your own criteria like size, etc.
To be honest, it was also pretty exciting for me to compress and extract files by writing my own code. I hope you enjoyed the tutorial, and if you have any questions, please let me know in the comments.
The Power of PowerShell, Part 2
In part one, I showed you some cool stuff you can do with PowerShell, covered the history of PowerShell, and explored in depth the capabilities of PowerShell as a strong scripting language that supports procedural, functional, and object-oriented programming.
In part two, I'll discuss the interactive shell, the profile, and the prompt, and I'll compare PowerShell to Bash.
PowerShell: The Interactive Shell
PowerShell was designed from the get-go as an interactive shell for Windows sys admins and power users. It focuses on a small number of concepts, very consistent experience, and an object pipeline to chain and combine commands, filter them and format them. Its strong help system, which also adheres to a consistent format, is a pleasure to use.
Let's see some of that in action.
Getting Help
The comprehensive help system is accessible through Get-Help
.
PS C:\WINDOWS\system32> Help Invoke-WebRequest NAME Invoke-WebRequest SYNOPSIS Gets content from a web page on the Internet. SYNTAX Invoke-WebRequest [-Uri] <Uri> [-Body <Object>] [-Certificate <X509Certificate>] [-CertificateThumbprint <String>] [-ContentType <String>] [-Credential <PSCredential>] [-DisableKeepAlive] [-Headers <IDictionary>] [-InFile <String>] [-MaximumRedirection <Int32>] [-Method {Default | Get | Head | Post | Put | Delete | Trace | Options | Merge | Patch}] [-OutFile <String>] [-PassThru] [-Proxy <Uri>] [-ProxyCredential <PSCredential>] [-ProxyUseDefaultCredentials] [-SessionVariable <String>] [-TimeoutSec <Int32>] [-TransferEncoding {chunked | compress | deflate | gzip | identity}] [-UseBasicParsing] [-UseDefaultCredentials] [-UserAgent <String>] [-WebSession <WebRequestSession>] [<CommonParameters>] DESCRIPTION The Invoke-WebRequest cmdlet sends HTTP, HTTPS, FTP, and FILE requests to a web page or web service. It parses the response and returns collections of forms, links, images, and other significant HTML elements. This cmdlet was introduced in Windows PowerShell 3.0. RELATED LINKS Online Version: http://ift.tt/2gRZ1P0 Invoke-RestMethod ConvertFrom-Json ConvertTo-Json REMARKS To see the examples, type: "get-help Invoke-WebRequest -examples". For more information, type: "get-help Invoke-WebRequest -detailed". For technical information, type: "get-help Invoke-WebRequest -full". For online help, type: "get-help Invoke-WebRequest -online"
To get more detailed help and see examples, use the proper switches: -examples
, -details
, or -full
.
If you're not sure what the command name is, just use keywords and PowerShell will show you all the available commands that contain this keyword. Let's see what cmdlets related to CSV are available:
PS C:\Users\the_g> Get-Help -Category Cmdlet csv | select name Name ---- ConvertFrom-Csv ConvertTo-Csv Export-Csv Import-Csv
I created a little pipeline where I limited the Get-Help call only to the category Cmdlet and then piped it to the "select" (alias for Select-Object) to get only the "name" property.
Working With Files and Directories
You can do pretty much everything you're used to: navigating to various directories, listing files and sub-directories, examining the content of files, creating directories and files, etc.
PS C:\Users\the_g> mkdir test_dir | select name Name ---- test_dir PS C:\Users\the_g> cd .\test_dir PS C:\Users\the_g\test_dir> "123" > test.txt PS C:\Users\the_g\test_dir> ls | name Name ---- test.txt PS C:\Users\the_g\test_dir> get-content .\test.txt 123
Working With Other Providers
With PowerShell, you can treat many things as file systems and navigate them using cd
and check their contents using ls/dir
. Here are some additional providers:
Provider Drive Data store -------- ----- ---------- Alias Alias: Windows PowerShell aliases Certificate Cert: x509 certificates for digital signatures Environment Env: Windows environment variables Function Function: Windows PowerShell functions Registry HKLM:, HKCU: Windows registry Variable Variable: Windows PowerShell variables WSMan WSMan: WS-Management configuration information
Let's check out the environment and see what Go-related environment variables are out there (on my machine):
PS C:\Users\the_g> ls env:GO* Name Value ---- ----- GOROOT C:\GO\ GOPATH C:\Users\the_g\Documents\Go
Formatting
PowerShell encourages composing cmdlets with standard switches and creating pipelines. Formatting is an explicit concept where in the end of a pipeline you put a formatter. PowerShell by default examines the type of object or objects at the end of the pipe and applies a default formatter. But you can override it by specifying a formatter yourself. Formatters are just cmdlets. Here is the previous output displayed in list format:
PS C:\Users\the_g> ls env:GO* | Format-List Name : GOROOT Value : C:\Go\ Name : GOPATH Value : c:\Users\the_g\Documents\Go
The Profile
Power users that use the command line frequently have many tasks, pipelines, and favorite combinations of commands with default switches that they favor. The PowerShell profile is a PowerShell script file that is loaded and executed whenever you start a new session. You can put all your favorite goodies there, create aliases and functions, set environment variables, and pretty much everything else.
I like to create navigation aliases to deeply nested directories, activate Python virtual environments, and create shortcuts to external commands I run frequently, like git and docker.
For me, the profile is indispensable because PowerShell's very readable and consistent commands and switches are often too verbose, and the built-in aliases are often more trouble than help (I discuss this later). Here is a very partial snippet from my profile:
#--------------------------- # # D O C K E R # #--------------------------- Set-Alias -Name d -Value docker function di { d images } #--------------------------- # # G I T # #--------------------------- Set-Alias -Name g -Value git function gs { g status } function gpu { g pull --rebase } #------------------------- # # C O N D A # #------------------------- function a { activate.ps1 $args[0] } #------------------------ # # N A V I G A T I O N # #------------------------ function cdg { cd $github_dir } # MVP function cdm { a ov; cdg; cd MVP } # backend function cdb { a ov; cdg; cd backend } # scratch function cds { a ov; cdg; cd scratch } # backend packages function cdbp { cdb; cd packages } # Go workspace function cdgo { cd $go_src_dir }
The Prompt
PowerShell lets you customize your command prompt. You need to define a function called prompt()
. You can see the built-in prompt function:
PS C:\Users\the_g> gc function:prompt "PS $($executionContext.SessionState.Path.CurrentLocation)$('>' * ($nestedPromptLevel + 1)) "; # .Link # http://ift.tt/1t3sOF2 # .ExternalHelp System.Management.Automation.dll-help.xml PS C:\Users\the_g>
Here is a custom prompt function that displays the current time in addition to the current directory:
PS C:\Users\the_g> function prompt {"$(get-date) $(get-location) > "}
10/09/2016 12:42:36 C:\Users\the_g >
You can go wild, of course, and add colors and check various conditions like if you're in a particular git repository or if you're admin.
Aliases: The Dark Side
PowerShell got aliases wrong, in my opinion, on two separate fronts. First, the alias
command only allows the renaming of commands. You can't add common flags or options to make commands more useful by aliasing them to themselves.
For example, if you want to search in text line by line, you can use the Select-String
cmdlet:
# Create a little text file with 3 lines "@ ab cd ef @" > 1.txt # Search for a line containing d Get-Content 1.txt | Select-String d cd
That works, but many people would like to rename Select-String
to grep
. But grep
is by default case-sensitive, while Select-String
is not. No big deal—we'll just add the -CaseSensitive
flag, as in:
Set-Alias -Name grep -Value "Select-String -CaseSensitive"
Unfortunately, that doesn't work:
16:19:26 C:\Users\the_g> Get-Content 1.txt | grep D grep : The term 'Select-String -CaseSensitive' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:21 + Get-Content 1.txt | grep D + ~~~~ + CategoryInfo : ObjectNotFound: (Select-String -CaseSensitive:String) [], CommandNotFoundException + FullyQualifiedErrorId : CommandNotFoundException
The value of an alias must be either a cmdlet, a function, a script, or a program. No flags or arguments are allowed.
Now, you can do that very easily in PowerShell, but you'll have to use functions and not aliases. That pretty much constrains aliases to simple renaming, which can also be done by functions.
PowerShell vs. Bash
On the interactive shell side, PowerShell and Bash are pretty equal. Bash is more concise by default, but PowerShell's object pipeline makes complicated pipelines more manageable. ,
That said, you can probably accomplish anything with either one and if you're a power user then you'll have your own aliases, functions, and shortcuts for common tasks. On the scripting side, PowerShell goes far beyond Bash, and for system administration purposes it even beats Python, Ruby and friends.
An important aspect is availability. Bash comes pre-installed with most *nix distributions (unless specifically stripped) including macOS. It can also be installed on Windows via cygwin, git-bash, or msys. PowerShell comes pre-installed on Windows and just recently became available on Mac and Linux.
Conclusion
If you use Windows as a development machine or if you manage Windows machines then PowerShell is an indispensable tool. It is truly a well thought out superset of the Unix shells, and it comes pre-installed.
PowerShell is great software engineering at work. It evolved over a decade, and it kept innovating while maintaining its original conceptual integrity. The recent switch to open source and cross-platform signals that there is still a lot more to wait for.
Saturday, December 17, 2016
The Portable Guitarist—Amps, Cables and Connectors
This series has so far explained the advantages of an iOS-based rig, the gear you need, transporting it, and setting it up for live performance.
This tutorial is about amplification options plus the relevant cables and connectors.
Mono In; Stereo Out
Look at the guitar lead—there’s usually a ring below the tip, indicating a MONO, or single signal lead.
Two rings equals stereo—you’ve got the wrong lead. This is because a guitar produces a mono analogue signal. Consequently, all of the traditional gear that you’d plug a guitar into—pedals, amps and so on—come equipped with mono input sockets.
The iPad’s headphone socket, however, is a STEREO, or dual signal output. Furthermore, the output accepts an 1/8” (3.5mm) jack, whereas traditional guitar inputs accept an 1/4” (6.25mm) jack.
Simply put, the iPad produces too big a signal on too small a connector.
Thankfully, this can be overcome with the correct connections and leads. To make the right choice, however, you’ll need to decide what you’re plugging into: guitar amp, or PA (Public Address system).
Guitar Amp
Unless you own a stereo amp—such as the Roland Jazz Chorus, or run a two-amp set-up—you’ve the aforementioned stereo-into-mono problem.
iPad to the Rescue
Luckily, an iPad can output in mono as well as stereo. To do this, go to Settings, and select General.
Choose Accessibility, and scroll down to Hearing. You’ll find a slide button that activates Mono output. Try it through a mono cable into a single speaker; you’ll notice a louder, more detailed sound than when it’s in stereo mode.
Talking of cables…
Cable Conundrum
Online you’ll find any number of Y-cables that combine two mono signals into a single stereo signal.
Search ‘stereo to mono cable’, however, and you’ll get a cable consisting of one stereo connection and two mono connections. You’ll struggle to find a cable that has a single connection at either end.
Thankfully, there’s some good news.
Adaptors
You can run a mono cable from a stereo output (the iPad) to a mono input (the amp). You’ll need a stereo 1/8” jack adaptor, however, on one end of the 1/4” guitar cable for the iPad’s headphone socket.
You can also use a 1/8” jack stereo cable, but you’ll need a stereo-to-mono 1/4” adaptor on one end to plug into the amp.
These adaptors cost just a few pounds, and are found easily on Amazon and eBay.
A robust—but more expensive—solution is the iLine Mono Output Adaptor from IK Multimedia, for under £25. For around £50, it comes as part of the larger and very useful iLine Mobile Music Cable Kit.
Front End, Effects Loop
If the amp has an effects loop, you could use the iPad purely as an effects unit using cables and connectors described above.
Some words of caution:
- The amp’s overall output will be determined by the iPad, so you’ll need to turn that up
- You’ll notice a level of noise that’s higher than when plugging into the front of the amp. You may not notice it when playing, but it’ll be there when you’re not. Whether you proceed will be determined by how loud the gig is, and how much noise you or your audience can take
I’d choose to plug into the front end, but this also presents issues:
- Careful with the iPad’s volume; the louder you go, the harder you’ll drive the amp, which leads to distortion
- Like any effects pedal, its position in the signal chain affects how it performs. An expansive reverb may sound fine on a clean sound, but could get messy with distortion
PA
When I started using an iPad live, I never used a guitar amp. Why use one amp, when apps provide the sounds of many? Instead, I used a PA. It presents some new challenges, but solves a lot of problems.
Stereo Compatibility
Unlike a guitar amp, a PA accepts an array of input sources and connections. Of interest here are line input jack sockets, which are typically stereo compatible.
You could therefore connect your iPad with a 1/8” to 1/4” stereo cable. These are plentiful, and can be very cheap. Get the longest one you can, as you never know from gig to gig how far apart your equipment could be.
Double Up
Some line sockets accept stereo or mono jacks, so I run a 1/8” stereo Y-cable that terminates with two 1/4” mono jacks.
Each of these plug into separate channels, as two preamps means more output. This lets me lower the iPad’s volume, giving a cleaner input signal.
Effects
Time-based effects like delay and reverb consume the iPad’s processing power. If your PA has inbuilt effects—that you like the sound of—employ them.
Mine or Yours
I arrange my PA like a traditional guitar amp stack, placed behind me. If you’re worried about getting sound out front, most PA units have a monitor output; simply run a lead from it to your band’s PA.
However, if you don’t own a PA, you’ll have to plug into your band’s one. If so, consider these points:
- Your distance from the PA is the length of your lead
- As cables lengthen, you run into capacitance issues, causing treble loss—the longer the cable, the more muffled you sound
- If you can’t hear yourself then you need a powered monitor speaker
Conclusion
In the mono realm of guitarists, the stereo iPad can seem like a baffling choice, but you can make it work provided you:
- Understand what’s mono and what’s stereo
- Get the right cables and connectors
- Choose your amplification wisely
- A guitar amp’s front end is quieter than its effects loop
- A PA has more options
- Your own PA is easier than using your band’s
The next tutorial I'll explain the world of apps.
Friday, December 16, 2016
Thursday, December 15, 2016
Wednesday, December 14, 2016
Building Your First Web Scraper, Part 1
Rubyland has two gems that have occupied the web scraping spotlight for the past few years: Nokogiri and Mechanize. We spend an article on each of these before we put them into action with a practical example.
Topics
- Web Scraping?
- Permission
- The Problem
- Nokogiri
- Extraction?
- Pages
- API
- Node Navigation
Web Scraping?
There are fancier terms around than web or screen scraping. Web harvesting and web data extraction pretty much tell you right away what’s going on. We can automate the extraction of data from web pages—and it’s not that complicated as well.
In a way, these tools allow you to imitate and automate human web browsing. You write a program that only extracts the sort of data that is of interest to you. Targeting specific data is almost as easy as using CSS selectors.
A few years ago I subscribed to some online video course that had like a million short videos but no option to download them in bulk. I had to go through every link on my own and do the dreaded ‘save as’ myself. It was sort of human web scraping—something that we often need to do when we lack the knowledge to automate that kind of stuff. The course itself was alright, but I didn’t use their services anymore after that. It was just too tedious.
Today, I wouldn’t care too much about such mind-melting UX. A scraper that would do the downloading for me would take me only a couple of minutes to throw together. No biggie!
Let me break it down real quick before we start. The whole thing can be condensed into a couple of steps. First we fetch a web page that has the desired data we need. Then we search through that page and identify the information we want to extract.
The final step is to target these bits, slice them if necessary, and decide how and where you want to store them. Well-written HTML is often key to making this process easy and enjoyable. For more involved extractions, it can be a pain if you have to deal with poorly structured markup.
What about APIs? Very good question. If you have access to a service with an API, there is often little need to write your own scraper. This approach is mostly for websites that don’t offer that sort of convenience. Without an API, this is often the only way to automate the extraction of information from websites.
You might ask, how does this scraping thing actually work? Without jumping into the deep end, the short answer is, by traversing tree data structures. Nokogiri builds these data structures from the documents you feed it and lets you target bits of interest for extraction. For example, CSS is a language written for tree traversal, for searching tree data structures, and we can make use of it for data extraction.
There are many approaches and solutions out there to play with. Rubyland has two gems that have occupied the spotlight for a number of years now. Many people still rely on Nokogiri and Mechanize for HTML scraping needs. Both have been tested and proven themselves to be easy to use while being highly capable. We will look at both of them. But before that, I’d like to take a moment to address the problem that we are going to solve at the end of this short introductory series.
Permission
Before you start scraping away, make sure you have the permission of the sites you are trying to access for data extraction. If the site has an API or RSS feed, for example, it might not only be easier to get that desired content, it might also be the legal option of choice.
Not everybody will appreciate it if you do massive scraping on their sites—understandably so. Get yourself educated on that particular site you are interested in, and don’t get yourself in trouble. Chances are low that you will inflict serious damage, but risking trouble unknowingly is not the way to go.
The Problem
I needed to build a new podcast. The design was not where I wanted it to be, and I hated the way of publishing new posts. Damn WYSIWYGs! A little bit of context. About two years ago, I built the first version of my podcast. The idea was to play with Sinatra and build something super lightweight. I ran into a couple of unexpected issues since I tailor-made pretty much everything.
Coming from Rails, it was definitely an educational journey that I appreciate, but I quickly regretted not having used a static site that I could have deployed through GitHub via GitHub pages. Deploying new episodes and maintaining them lacked the simplicity that I was looking for. For a while, I decided that I had bigger fish to fry and focused on producing new podcast material instead.
This past summer I started to get serious and worked on a Middleman site that is hosted via GitHub pages. For season two of the show, I wanted something fresh. A new, simplified design, Markdown for publishing new episodes, and no fist fights with Heroku—heaven! The thing was that I had 139 episodes lying around that needed to be imported and converted first in order to work with Middleman.
For posts, Middleman uses .markdown
files that have so called frontmatter for data—which replaces my database basically. Doing this transfer by hand is not an option for 139 episodes. That’s what computation is for. I needed to figure out a way to parse the HTML of my old website, scrape the relevant content, and transfer it to blog posts that I use for publishing new podcast episodes on Middleman.
Therefore, over the next three articles, I’m going to introduce you to the tools commonly used in Rubyland for such tasks. In the end, we’ll go over my solution to show you something practical as well.
Nokogiri
Even if you are completely new to Ruby/Rails, chances are very good that you have already heard about this little gem. The name is dropped often and sticks with you easily. I'm not sure that many know that nokogiri is Japanese for “saw”.
It's a fitting name once you understand what the tool does. The creator of this gem is the lovely Tenderlove, Aaron Patterson. Nokogiri converts XML and HTML documents into a data structure—a tree data structure, to be more precise. The tool is fast and offers a nice interface as well. Overall, it’s a very potent library that takes care of a multitude of your HTML scraping needs.
You can use Nokogiri not only for parsing HTML; XML is fair game as well. It gives you the options of both XML path language and CSS interfaces to traverse the documents you load. XML path Language, or XPath for short, is a query language.
It allows us to select nodes from XML documents. CSS selectors are most likely more familiar to beginners. As with styles you write, CSS selectors make it fantastically easy to target specific sections of pages that are of interest for extraction. You just need to let Nokogiri know what you are after when you target a particular destination.
Pages
What we always need to start with is fetching the actual page we are interested in. We specify what kind of Nokogiri document we want to parse—XML or HTML for example:
Nokogiri::XML Nokogiri::HTML
some_scraper.rb
require "nokogiri" require "open-uri" page = Nokogiri::XML(File.open("some.xml")) page = Nokogiri::HTML(File.open("some.html"))
Nokogiri:XML
and Nokogiri:HTML
can take IO objects or String objects. What happens above is straightforward. This opens and fetches the designated page using open-uri
and then loads its structure, its XML or HTML into a new Nokogiri document. XML is not something beginners have to deal with very often.
Therefore, I’d recommend that we focus on HTML parsing for now. Why open-uri
? This module from the Ruby Standard Library lets us grab the site without much fuss. Because IO objects are fair game, we can make easy use of open-uri
.
API
Let’s put this into practice with a mini example:
at_css
some_podcast_scraper.rb
require 'nokogiri' require "open-uri" url = 'http://ift.tt/1Eqv5Ua' page = Nokogiri::HTML(open(url)) header = page.at_css("h2.post-title") title = header.text puts "This is the raw header of the latest episode: #{header}" puts "This is the title of the latest episode: #{title}"
What we did here represents all the steps that are usually involved with web scraping—just at a micro level. We decide which URL we need and which site we need to fetch, and we load them into a new Nokogiri document. Then we open that page and target a specific section.
Here I only wanted to know the title of the latest episode. Using the at_css
method and a CSS selector for h2.post-title
was all I needed to target the extraction point. With this method we will only scrape this singular element, though. This gives us the whole selector—which is most of the time not exactly what we need. Therefore we extract only the inner text portion of this node via the text
method. For comparison, you can check the output for both the header and the text below.
Output
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2> This is the title of the latest episode: David Heinemeier Hansson
Although this example has very limited applications, it possesses all the ingredients, all the steps that you need to understand. I think it’s cool how simple this is. Because it might not be obvious from this example, I would like to point out how powerful this tool can be. Let’s see what else we can do with a Nokogiri script.
Attention!
If you are a beginner and not sure how to target the HTML needed for this, I recommend that you search online to find out how to inspect the contents of websites in your browser. Basically, all major browsers make this process really easy these days.
On Chrome you just need to right-click on an element in the website and choose the inspect option. This will open a small window at the bottom of your browser which shows you something like an x-ray of the site’s DOM. It has many more options, and I would recommend spending some time on Google to educate yourself. This is time spent wisely!
css
The css
method will give us not only a single element of choice but any element that matches the search criteria on the page. Pretty neat and straightforward!
some_scraper.rb
require 'nokogiri' require "open-uri" url = 'http://ift.tt/1Eqv5Ua' page = Nokogiri::HTML(open(url)) headers = page.css("h2.post-title") headers.each do |header| puts "This is the raw title of the latest episode: #{header}" end headers.each do |header| puts "This is the title of the latest episode: #{header.text}" end
Output
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2> This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2> This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2> This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2> This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2> This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/137/">Roberto Machado</a></h2> This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/136/">James Edward Gray II</a></h2> This is the title of the latest episode: David Heinemeier Hansson This is the title of the latest episode: Zach Holman This is the title of the latest episode: Joel Glovier This is the title of the latest episode: João Ferreira This is the title of the latest episode: Corwin Harrell This is the title of the latest episode: Roberto Machado This is the title of the latest episode: James Edward Gray II
The only little difference in this example is that I iterate on the raw headers first. I also extracted its inner text with the text
method. Nokogiri automatically stops at the end of the page and does not attempt to follow the pagination anywhere automatically.
Let’s say we want to have a bit more information, say the date and the subtitle for each episode. We can simply expand on the example above. It is a good idea anyway to take this step by step. Get a little piece working and add in more complexity along the way.
some_scraper.rb
require 'nokogiri' require "open-uri" url = 'http://ift.tt/1Eqv5Ua' page = Nokogiri::HTML(open(url)) articles = page.css("article.index-article") articles.each do |article| header = article.at_css("h2.post-title") date = article.at_css(".post-date") subtitle = article.at_css(".topic-list") puts "This is the raw header: #{header}" puts "This is the raw date: #{date}" puts "This is the raw subtitle: #{subtitle}\n\n" puts "This is the text header: #{header.text}" puts "This is the text date: #{date.text}" puts "This is the text subtitle: #{subtitle.text}\n\n" end
Output
This is the raw header: <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2> This is the raw date: <span class="post-date">Oct 18 | 2016</span> This is the raw subtitle: <h3 class="topic-list">Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk</h3> This is the text header: David Heinemeier Hansson This is the text date: Oct 18 | 2016 This is the text subtitle: Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk This is the raw header: <h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2> This is the raw date: <span class="post-date">Oct 12 | 2016</span> This is the raw subtitle: <h3 class="topic-list">Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication</h3> This is the text header: Zach Holman This is the text date: Oct 12 | 2016 This is the text subtitle: Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication This is the raw header: <h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2> This is the raw date: <span class="post-date">Oct 10 | 2016</span> This is the raw subtitle: <h3 class="topic-list">Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source</h3> This is the text header: Joel Glovier This is the text date: Oct 10 | 2016 This is the text subtitle: Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source This is the raw header: <h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2> This is the raw date: <span class="post-date">Aug 26 | 2015</span> This is the raw subtitle: <h3 class="topic-list">Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values</h3> This is the text header: João Ferreira This is the text date: Aug 26 | 2015 This is the text subtitle: Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values This is the raw header: <h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2> This is the raw date: <span class="post-date">Aug 06 | 2015</span> This is the raw subtitle: <h3 class="topic-list">Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |</h3> This is the text header: Corwin Harrell This is the text date: Aug 06 | 2015 This is the text subtitle: Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays | This is the raw header: <h2 class="post-title"><a href="episodes/137/">Roberto Machado</a></h2> This is the raw date: <span class="post-date">Aug 03 | 2015</span> This is the raw subtitle: <h3 class="topic-list">CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD & BDD | Startup mistakes | Culture of learning | Young entrepreneurs</h3> This is the text header: Roberto Machado This is the text date: Aug 03 | 2015 This is the text subtitle: CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD & BDD | Startup mistakes | Culture of learning | Young entrepreneurs This is the raw header: <h2 class="post-title"><a href="episodes/136/">James Edward Gray II</a></h2> This is the raw date: <span class="post-date">Jul 30 | 2015</span> This is the raw subtitle: <h3 class="topic-list">Screencasting | Less Code | Reading code | Getting unstuck | Rails’s codebase | CodeNewbie | Small examples | Future plans | PeepCode | Frequency & pricing</h3> This is the text header: James Edward Gray II This is the text date: Jul 30 | 2015 This is the text subtitle: Screencasting | Less Code | Reading code | Getting unstuck | Rails’s codebase | CodeNewbie | Small examples | Future plans | PeepCode | Frequency & pricing
At this point, we already have some data to play with. We can structure or butcher it any way we like. The above should simply show what we have in a readable fashion. Of course we can dig deeper into each of these by using regular expressions with the text
method.
We will look into this in a lot more in detail when we get to solving the actual podcast problem. It won’t be a class on regexp, but you will see some more of it in action—but no worries, not so much as to make your brain bleed.
Attributes
What could be handy at this stage is extracting the href
for the individual episodes as well. It couldn’t be any simpler.
some_scraper.rb
require 'nokogiri' require "open-uri" url = 'http://ift.tt/1Eqv5Ua' page = Nokogiri::HTML(open(url)) articles = page.css("article.index-article") articles.each do |article| header = article.at_css("h2.post-title") date = article.at_css(".post-date") subtitle = article.at_css(".topic-list") link = article.at_css("h2.post-title a") podcast_url = "http://ift.tt/1Eqv5Ua" puts "This is the raw header: #{header}" puts "This is the raw date: #{date}" puts "This is the raw subtitle: #{subtitle}" puts "This is the raw link: #{link}\n\n" puts "This is the text header: #{header.text}" puts "This is the text date: #{date.text}" puts "This is the text subtitle: #{subtitle.text}" puts "This is the raw link: #{podcast_url}#{link[:href]}\n\n" end
The most important bits to pay attention to here are [:href]
and podcast_url
. If you tag on [:]
you can simply extract an attribute from the targeted selector. I abstracted a little further, but you can see more clearly how it works below.
... href = article.at_css("h2.post-title a")[:href] ...
To get a complete and useful URL, I saved the root domain in a variable and constructed the full URL for each episode.
... podcast_url = "http://ift.tt/1Eqv5Ua" puts "This is the raw link: #{podcast_url}#{link[:href]}\n\n" ...
Let’s take a quick look at the output:
Output
This is the raw header: <h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2> This is the raw date: <span class="post-date">Oct 25 | 2016</span> This is the raw subtitle: <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3> This is the raw link: <a href="episodes/143/">Jason Long</a> This is the text header: Jason Long This is the text date: Oct 25 | 2016 This is the text subtitle: Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas This is the href: http://ift.tt/2gA5f61 This is the raw header: <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2> This is the raw date: <span class="post-date">Oct 18 | 2016</span> This is the raw subtitle: <h3 class="topic-list">Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk</h3> This is the raw link: <a href="episodes/142/">David Heinemeier Hansson</a> This is the text header: David Heinemeier Hansson This is the text date: Oct 18 | 2016 This is the text subtitle: Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk This is the href: http://ift.tt/2hEll3v This is the raw header: <h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2> This is the raw date: <span class="post-date">Oct 12 | 2016</span> This is the raw subtitle: <h3 class="topic-list">Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication</h3> This is the raw link: <a href="episodes/141/">Zach Holman</a> This is the text header: Zach Holman This is the text date: Oct 12 | 2016 This is the text subtitle: Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication This is the href: http://ift.tt/2dZ8mqu This is the raw header: <h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2> This is the raw date: <span class="post-date">Oct 10 | 2016</span> This is the raw subtitle: <h3 class="topic-list">Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source</h3> This is the raw link: <a href="episodes/140/">Joel Glovier</a> This is the text header: Joel Glovier This is the text date: Oct 10 | 2016 This is the text subtitle: Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source This is the href: http://ift.tt/2hEsTmM This is the raw header: <h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2> This is the raw date: <span class="post-date">Aug 26 | 2015</span> This is the raw subtitle: <h3 class="topic-list">Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values</h3> This is the raw link: <a href="episodes/139/">João Ferreira</a> This is the text header: João Ferreira This is the text date: Aug 26 | 2015 This is the text subtitle: Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values This is the href: http://ift.tt/2gA8Lgr This is the raw header: <h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2> This is the raw date: <span class="post-date">Aug 06 | 2015</span> This is the raw subtitle: <h3 class="topic-list">Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |</h3> This is the raw link: <a href="episodes/138/">Corwin Harrell</a> This is the text header: Corwin Harrell This is the text date: Aug 06 | 2015 This is the text subtitle: Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays | This is the href: http://ift.tt/2hEnDQ1 This is the raw header: <h2 class="post-title"><a href="episodes/137/">Roberto Machado</a></h2> This is the raw date: <span class="post-date">Aug 03 | 2015</span> This is the raw subtitle: <h3 class="topic-list">CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD & BDD | Startup mistakes | Culture of learning | Young entrepreneurs</h3> This is the raw link: <a href="episodes/137/">Roberto Machado</a> This is the text header: Roberto Machado This is the text date: Aug 03 | 2015 This is the text subtitle: CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD & BDD | Startup mistakes | Culture of learning | Young entrepreneurs This is the href: http://ift.tt/2gA6bHg
Neat, isn’t it? You can do the same to extract the [:class]
of a selector.
require 'nokogiri' require "open-uri" url = 'http://ift.tt/1Eqv5Ua' page = Nokogiri::HTML(open(url)) body_classes = page.at_css("body")[:class]
If that node has more than one class, you will get a list of all of them.
Node Navigation
- parent
- children
- previous_sibling
- next_sibling
We are used to dealing with tree structures in CSS or even jQuery. It would be a pain if Nokogiri didn't offer a handy API to move within such trees.
some_scraper.rb
require 'nokogiri' require "open-uri" url = 'http://ift.tt/1Eqv5Ua' page = Nokogiri::HTML(open(url)) header = page.at_css("h2.post-title") header_children = page.at_css("h2.post-title").children header_parent = page.at_css("h2.post-title").parent header_prev_sibling = page.at_css("h2.post-title").previous_sibling puts "#{header}\n\n" puts "#{header_children}\n\n" puts "#{header_parent}\n\n" puts "#{header_prev_sibling}\n\n"
Output
#header <h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2> #header_children <a href="episodes/143/">Jason Long</a> #header_parent <article class="index-article"> <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2> <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3> <div class="soundcloud-player-small"> </div> </article> #header_previous_sibling <span class="post-date">Oct 25 | 2016</span>
As you can see for yourself, this is some pretty powerful stuff—especially when you see what .parent
was able to collect in one go. Instead of defining a bunch of nodes by hand, you could collect them wholesale.
You can even chain them for more involved traversals. You can take this as complicated as you like, of course, but I would caution you to keep things simple. It can quickly get a little unwieldy and hard to understand. Remember, "Keep it simple, stupid!"
... header_parent_parent = page.at_css("h2.post-title").parent.parent header_prev_sibling_parent_children = page.at_css("h2.post-title").previous_sibling.parent.children ...
some_scraper.rb
require 'nokogiri' require "open-uri" url = 'http://ift.tt/1Eqv5Ua' page = Nokogiri::HTML(open(url)) header = page.at_css("h2.post-title") header_prev_sibling_children = page.at_css("h2.post-title").previous_sibling.children header_parent_parent = page.at_css("h2.post-title").parent.parent header_prev_sibling_parent = page.at_css("h2.post-title").previous_sibling.parent header_prev_sibling_parent_children = page.at_css("h2.post-title").previous_sibling.parent.children puts "#{header}\n\n" puts "#{header_prev_sibling_children}\n\n" puts "#{header_parent_parent}\n\n" puts "#{header_prev_sibling_parent}\n\n" puts "#{header_prev_sibling_parent_children}\n\n"
Output
#header <h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2> #header_previous_sibling_children Oct 25 | 2016 #header_parent_parent <li> <article class="index-article"> <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2> <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3> <div class="soundcloud-player-small"> </div> </article> </li> #header_previous_sibling_parent <article class="index-article"> <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2> <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3> <div class="soundcloud-player-small"> </div> </article> #header_previous_sibling_parent_children <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2> <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3> <div class="soundcloud-player-small"> </div>
Final Thoughts
Nokogiri is not a huge library, but it has a lot to offer. I recommend you play with what you have learned thus far and expand your knowledge through its documentation when you hit a wall. But don’t get yourself into trouble!
This little intro should get you well on your way to understanding what you can do and how it works. I hope you will explore it a bit more on your own and have some fun with it. As you will find out on your own, it’s a rich tool that keeps on giving.