Intro

Lately I've been working on a project for the UConn admissions office. Part of the project was to create a WordPress plugin that would take a CSV file and read the rows into the database on a schedule. Which is easy enough right? The problem was that there were over 16,000 rows to insert....

So.

Obviously (even though I knew it wouldn't work) I tried just inserting them. But that took about 30 seconds to blow through the memory constraint on the server. So I want to share a few solutions I worked through first, and then explain why I landed on a generator function.

In this post you'll learn:

  • Some solutions I considered and rejected
  • What a PHP generator function is
  • When and why to use a generator
  • Potential downsides to generators

Possible (Rejected) Solutions

Batching

One choice was to set an arbitrary number of rows (say 500). Then inside a while loop keep track of that. Then every time that limit is reached, restart some process. That would work, but seems inelegant.

File manipulation

Another choice was to read the file and then split it and write new files to the file system. Each one of those could have an arbitrary number of rows. Then I could read each file separately. But this didn't really feel right to me. What would happen if all of a sudden I lost write access to the directory the files needed to be in?

Batch insert

SQL can batch insert rows into a database and there are a few ways to do it. But it would mean writing a custom SQL query instead of relying on the WordPress core database methods. That's kind of the only down side. But, while I can write basic SQL queries, I was nervous about writing a complex query for one of the most important sies a university has....

Using A Generator

All that research led me to remember about generators. For this kind of operation, they're perfect.

A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which may cause you to exceed a memory limit, or require a considerable amount of processing time to generate.

PHP Generator Documentation

Perfect! A function that creates an array-like result but doesn't hit memory limits? Sign me up!

So how does this work? Let me show you some code. For a lot of projects, I've been using a (more) object oriented approach. So I'll show this in terms of a class.

Basically this class needs to be able to do three things:

  • set itself up (including access to the WP database)
  • read the file
  • write the data

Set Up A Class

I wrote my own class to wrap the global $wpdb variable. For now, I'm not going to get into that. But it made sense for this project and maybe in a later post I'll explain in more detail.

<?php

namespace MyPlugin;

use MyPlugin\Database;

class FileManager {
  protected $db;

  public function __construct() {
    // set up the class
    // prepare database access
    $this->db = new Database();
  }
}

Prepare A Method to Yield

Next, you need a method that will yield instead of return a value. Yeah... That's all you need to do to create a generator. So let's set that up and I'll explain as I go.

<?php

namespace MyPlugin;

// Generator is the core PHP class needed to use the yield keyword
use Generator
use MyPlugin\Database;

class FileManager {
  protected $db;

  public function __construct() {
    // set up the class
    // prepare database access
    $this->db = new Database();
  }

  public function yieldData() {
    // read the file and yield the data
  }
}

Now, I'll zoom in on the yieldData method. This method needs to

  • read the file as a stream
  • if the stream is valid...
  • yield the value for each row while there are rows to read. Note that I've typed the "return" value as Generator. That will keep me (or anyone else) from accidentally trying to use return. I hope.
  • finally stop reading the file
<?php

public function yieldData(): Generator {
  // filePath is defined in the constructor.
  // the 'r' mode if for read only.
  $stream = fopen($this->filePath, 'r');

  // let's try this out
  try {
    // if the stream is good
    if ($stream !== false) {
      // get CSV data from the stream.
      // assume each line is 1000 characters long
      // break on tabs
      // when we run out of data, stop.
      while (($data = fgetcsv($stream, 1000, "\t")) !== false) {
        // yield instead of return
        yield $data;  
      }
    }
  } finally {
    // close out the stream when we're done
    fclose($stream);
  }
}

Now, we can call this method with $this->yeildData() and get all the data without building an array with all the memory constraints that go with that.

Write to the Database

Let's go back to the class and prepare a method that will write to the database. Then I'll zoom into that and walk through it.

<?php

namespace MyPlugin;

// Generator is the core PHP class needed to use the yield keyword
use Generator
use MyPlugin\Database;

class FileManager {
  protected $db;

  public function __construct() {
    // set up the class
    // prepare database access
    $this->db = new Database();
  }

  public function yieldData() {
    // read the file and yield the data
  }

  public function writeToDatabase() {
    // write to the database
  }
}

This method needs to do a few things.

  • Set up a counter to return. Since I'm using this method in an API call, I want to make sure there's some kind of sensible response. Counting the number of rows is a good way to go.
  • Get the $rows from the yieldData method. This will prepare an iterator.
  • Insert the row and get the result. The WordPress insert method returns either an integer on success or false on failure. This will...
  • In crease the count and return it.
<?php
// zooming in...
public function writeToDatabase() {
  // prepare the counter
  $count = 0;
  // yield the data
  $rows = $this->yieldData();
  // loop!
  foreach ($rows as $row) {
    // add successful inserts to the count
    $count += $this->db->insertRow('my_table', [
      'school' => $row[0],
      'schoolid' => "0",
      'ext_subject' => $row[1],
      'ext_number' => $row[2],
      'int_subject' => $row[3],
      'int_number' => $row[4],
      'int_title' => substr($row[5], 0, 75),
      'seq_number' => $row[6]
    ]);
  }
  // return the count
  return $count;
}

Putting it Together

So now, the class looks like this.

<?php

namespace MyPlugin;

// Generator is the core PHP class needed to use the yield keyword
use Generator
use MyPlugin\Database;

class FileManager {
  protected $db;

  public function __construct() {
    // set up the class
    // prepare database access
    $this->db = new Database();
  }

  public function yieldData(): Generator {
    // filePath is defined in the constructor.
    // the 'r' mode if for read only.
    $stream = fopen($this->filePath, 'r');

    // let's try this out
    try {
      // if the stream is good
      if ($stream !== false) {
        // get CSV data from the stream.
        // assume each line is 1000 characters long
        // break on tabs
        // when we run out of data, stop.
        while (($data = fgetcsv($stream, 1000, "\t")) !== false) {
          // yield instead of return
          yield $data;  
        }
      }
    } finally {
      // close out the stream when we're done
      fclose($stream);
    }
  }

  public function writeToDatabase(): int {
    $count = 0;
    $rows = $this->yieldData();
    foreach ($rows as $row) {
      $count += $this->db->insertRow('my_table', [
        'school' => $row[0],
        'schoolid' => "0",
        'ext_subject' => $row[1],
        'ext_number' => $row[2],
        'int_subject' => $row[3],
        'int_number' => $row[4],
        'int_title' => substr($row[5], 0, 75),
        'seq_number' => $row[6]
      ]);
    }
    return $count;
  }
}

Conclusion

Now, whenever I need to write the file contents to the database, I can just do this -

<?php

use MyPlugin\FileManager;

$fileManager = new FileManager();
$fileManager->writeToDatabase();

This all feels so much nicer and more stable to me than any of the other methods. Are there downsides to using generators? Well sure. They can be slower than arrays. In this case it doesn't matter since this isn't a public facing operation. If you really need to use the yielded result as an array, you have to use the iterator_to_array PHP function. There's also an extra step if you need keys as well as values. But on the whole? For something like this, I think it's pretty perfect.

I hope you enjoyed this! Let me know what you think 🙂