Find PDFs that are not OCRed (Windows grep)

Works reasonably well, although some files may have font properties without the word “Font” (capitalization matters).
Open CMD prompt
dir Desktop (or whatever target directory)
grep -L Font *.pdf > list_of_files.txt
-L = only return file names that do not match
-r = Recursive

Quickly access journal PDFs for which UND has a subscription

If you're using Google or another engine to search online for journal articles, seven times out of ten you'll end up at the site where you can get the PDF (via institutional subscription) but you won't be recognized as the institution.  This also will happen if you are off-campus.  One way to get access is to head back to the UND Libraries site, find the eJournal search, type the name, and then navigate to the right issue.  There is a better way.

Take this URL for example.  I wound up here after Googling for one of the authors and looking for his email address:

In order to get that PDF (or see if we even have access), just add the UND proxy to the middle of the URL (bolded for convenience):

You'll get bumped to the UND Libraries off-campus login page, and then back to the article page when you've logged in.  Now if you click on the PDF link, you'll find out whether we have subscription access (the PDF will open) or not (you'll hit a paywall).  This has worked well for me for the past year or so.

Searching the PDF Library

[EDIT: The PDF search described below no longer exists, but the mention of a preprint server for other sciences is becoming a reality with PeerJ Preprints.]

For those who are looking for paleontology or geology papers in PDF format, you might be able to find them with the full-text search I’ve installed here. There are 40 GB or so of files to access. If you find a file there that you can’t access any other way, drop me an email and I can send it.

This is the easiest way I can think of to share my PDF library at the moment. In the past I’ve experimented with Alliance, OneSwarm, and even torrenting, but the first two applications require a critical mass of users to make viable (something I’ve never been able to get) and the last is difficult to update.

While a preprint server such as (but for other sciences [than physics, 2014-02-04]) would be useful for the future, it wouldn’t help to distribute the vast knowledge contained in works that are out of print. For this purpose we, as scientists, need to form our own distribution network. I will keep this directory up for myself and those who need it, but for complete sharing of published works I still think we need a P2P network devoted to that purpose.

[publication] A new occurrence of /Protichnites/ Owen, 1852, in the Late Cambrian Potsdam Sandstone of the St. Lawrence Lowlands

BURTON-KELLY, M.E. and J.M. ERICKSON. 2010. A new occurrence of Protichnites Owen, 1852, in the Late Cambrian Potsdam Sandstone of the St. Lawrence Lowlands. The Open Paleontology Journal 3:1-13.

You can download a PDF from here 1MB. You can follow this publication on or ResearchGate.

Buying PDFs: Commentary

This post was originally a comment on Andy’s post “Buying PDFs: Truth and Consequences” at The Open Paleontologist blog. The text grew too long, so I’m devoting a full post to it, even though it’s a bit rough. The topic is how much we pay for PDFs of published articles, and why this is so disproportionate to physical copies.

People who know me already know what my suggested “solution” is, which is to share as many PDFs with as many people as possible in order to help the publishers reevaluate their prices, however…legality prevents me from supporting taking such action. This is modeled after the philosophy of Downhill Battle: in order to get radio stations to play music beyond the mainstream (paid for by the record companies), we need to bankrupt the record companies, essentially by quitting buying music, or at least music produced by the largest companies who pay the biggest bucks toward keeping their music on the air.

I’m not sure if Andy has a citation for his observation that publishers like Elsevier that continue “to post profits in the midst of the recession”? Having someone play with those numbers a bit would be interesting to do.

This ends up being like gas prices. I get that as a business you get to set your prices as the market will bear, but the strategy of moving more merchandise rather than more expensive merchandise should always be something to consider. How much research do these publishers do as far as sub-fields go? As you say, hospitals can pay top dollar for a single article, but more paleontologists will buy an article if it’s cheaper (especially if they are unaffiliated), will be able to do the research they want, and will be looking for a place to publish.

On that note, I hope people continue to vote with their feet when it comes to open-access vs. closed-access, or even if some journals have slightly lower per-PDF fees. I’ve had the discussion recently about what “high impact” means anymore: nothing. It used to mean that the physical journal was available in more libraries and hence better-read and better-cited, but since everything goes to PDF now, everything (new) is equally available to someone who can do a halfway decent job of searching. This gives us all the freedom to publish in journals with whose practices we agree, rather than who has a wider physical distribution.


One of my interests is building a PDF library for myself and fellow graduate and undergraduate students. Which means that it’s very hard to pass up PDFs when I come across them on the web. So right now I’m downloading anything I even look at during my research, to keep and to pass along.

This may well fill up my hard drive this semester.

Alliance p2p notes

I’m working my way through this program. I think it works really well, these are just my notes. I haven’t gotten around to posting these in the official forum or bug tracker yet.

Invite Codes
*The only thing the LAN checkbox affects is new invite codes. It does not change how your computer accesses anything else.

*Leaving LAN unchecked uses the web-accessible IP address of the computer. Checking LAN uses the local address (192.168.XXX.XXX).

*You can create two invite codes for each computer: a LAN code and a web code. Sometimes one will work where the other one will not, depending on the relationship between the two computers.

Moving Around
*Switching one computer from one network to another (e.g., from a wireless access point to a wall jack) sometimes results in loss of connection between the machines that cannot be repaired without a)changing the hostname of the buddy to either a local or web-accessible IP or b)deleting a buddy and re-adding them with their invite code.

*It may be the case that Reconnect has to be pressed by users on both ends of a connection after one person has switched networks in order for the change in IP address to register and the connection to be reestablished.

*It is unclear what “Reconnect” actually does.

*It is unclear how long it should take for certain things to happen. Sometimes it takes a couple minutes for everything to get situated after starting Alliance or switching networks. It also lags a couple minutes when someone leaves the network before removing them from the active users list.

*Should you be able to add friends of friends with whom you cannot connect? There is no dialogue or notification when they are unavailable after clicking “Finish.” I know that not everyone’s friends of friends will be online all the time, but if they are, and are not accessible from the current location, that should be reflected somewhere for the user.

*What port is being used to send invite codes? Is it possible that the router is getting mixed up and using the wrong port? Is this encoded into the invite code?

*When downloading, then disconnecting, then attempting to download from a different buddy, download will not begin. After removing everything in the download queue, restart is still required before downloading will commence. This may be related to switching networks too.

*Is there a limit on number of files that can be shown per directory? Alliance will not display a folder with 4840 files (kicks the connection for a few minutes), but when searching will display files in that folder that match the search criteria. This may also be related to funky characters in a filename within the folder–since those files weren’t part of the search results, the folder displayed just fine. It was the number of files–keep it under about 1,000.

*Sometimes the checkbox to browse a folder just disappears. Why is this? Because the shared directory name has been changed and the folder doesn’t exist anymore.

*When you have two laptops on the same wireless network and very few people using Alliance in general, it acts like there needs to be a critical mass in order for sharing to really occur. Such as, A and B are in the same room, on the same network, and can see each other. B can see a third computer C, on a different part of the network (outside the wireless access point), but A cannot. In this case, Friends of Friends seems to work, but user C never shows up for A.

Open-Access Journals

The current annoyance on the VRTPALEO list is the academic publishing industry, who will publish your work in exchange for owning the copyright (meaning that you, as an author, cannot distribute your own work without permission). A simplified but good analogy is made by Scott Aaronson here:

I have an ingenious idea for a company. My company will be in the business of selling computer games. But, unlike other computer game companies, mine will never have to hire a single programmer, game designer, or graphic artist. Instead I’ll simply find people who know how to make games, and ask them to donate their games to me. Naturally, anyone generous enough to donate a game will immediately relinquish all further rights to it. From then on, I alone will be the copyright-holder, distributor, and collector of royalties. This is not to say, however, that I’ll provide no “value-added.” My company will be the one that packages the games in 25-cent cardboard boxes, then resells the boxes for up to $300 apiece.

But why would developers donate their games to me? Because they’ll need my seal of approval. I’ll convince developers that, if a game isn’t distributed by my company, then the game doesn’t “count” — indeed, barely even exists — and all their labor on it has been in vain.

Admittedly, for the scheme to work, my seal of approval will have to mean something. So before putting it on a game, I’ll first send the game out to a team of experts who will test it, debug it, and recommend changes. But will I pay the experts for that service? Not at all: as the final cherry atop my chutzpah sundae, I’ll tell the experts that it’s their professional duty to evaluate, test, and debug my games for free!

We need to figure out a way to exchange information without making people pay exorbitant fees for it, but in the current situation we could be sued for distributing our own work in PDF format. I’m no opponent of paper copies of Journals, but if all you want is a PDF of a work that is peer-reviewed, there’s no reason you should have to pay for it.

EDIT: This person has something to say about it too, with an analogy to the QWERTY keyboard.