OSINT in the newsroom

george · 2017-02-17 22:26:11 UTC

Any journos out there worked in newsrooms where getting organized for collecting OSINT was made a priority for reporters/engineers?

Reporters have used Accurint and equivalents for quite some time but what about other ways of developing source material through active and organized culling of social networks and other available material. Something like ICWatch comes to mind.

If so what software have you used? If not, what tools would folks recommend?

UPDATE:
What's the best open source tool for doing social network analysis?

rorybyrne · 2017-02-18 18:57:37 UTC

I know that my old place, Videre, used Sentinel Visualizer, amongst other tools. It's fairly pricey though.

http://www.fmsasg.com/Solutions/investigations/human-rights-videos.html

george · 2017-02-18 20:06:23 UTC

It's not a subscription at least. Looks pretty robust. I think I'd be most interested in the SNA. Reframing question ...

UPDATE: Wonder how Maltego's community edition stacks up? I remember downloading it a while ago but never gave it a shot. Long weekend project, here I come.

rorybyrne · 2017-02-19 18:01:16 UTC

Maltego is pretty good for certain types of OSINT. Especially technical. Though I find it a little tricky to do some of the social network analysis things in the same way that other tools do. Though I have high hope for Maltego CE as a useful starting point for some orgs.

jonathanstray · 2017-02-21 23:09:14 UTC

We've been experimenting with Neo4j here at Columbia. This and other graph systems are starting to be used in journalism (1,2) and I am actively pursuing it with my students because I believe in the huge potential. But tbh not that much has yet come of this type of work, here or elsewhere.

The main problems, as I see it:

Getting enough data / the right data in is a HUGE amount of work. The world is more than Twitter. In our case (the Global Shipping Project) the useful resources are mostly subscription only databases like SeaWeb. These need to be searched and/or scraped before they can be loaded. You also end up working with lots of crazy dirty documents, which would need to have entities extracted -- a tremendously difficult process. You think this is just "oh I'll run Stanford NER" but it's really, really not. My very own Overview has a useful partial solution to this problem, and it's got an API too if you felt like working on it further.
The UI really matters and I haven't yet found a program which works well. The issue is you almost never want to see all connections from a node, which is what most graph DB front ends do. Instead you want to build up the graph piece by piece as you explore the network, retaining only those nodes and edges which seem to be "a story." Conversely, there are lots of graph tools that let you build a graph up like this, but essentially none of them connect to a DB (Linkurious may be the notable counterexample.)
Several people have proposed finding stories by doing graph queries (I wrote about OCCRP's "Crime Pattern Recognition" project) but I do not know of a single story that was found this way. There are multiple reasons for this, including 1) perhaps not that many stories that might actually be findable this way 2) you need to put a LOT of work into data import and cleanup before you can run the query, and it's often better to put that work into manual exploration of promising leads.
Collaboration is a huge issue. You really want reporters around the world to share because connecting the dots is the name of the game, but there are big legal, security, privacy, and competitive problems with that. And it's damn near impossible to get reporters to upload their documents when they may or may not ever be used in the future, and only by someone else. I organized a conference in London a few years ago called Knowledge Management for Investigative Journalism and we came up with a really promising solution to this problem called DocHunt-- which no one has yet implemented.

I do have strong feelings about what the investigative journalism OSINT platform would look like. Basically, integrate a document store with a graph database and a UI that supports fuzzy search and entity/time/network analytics. It would look something like Jigsaw or New/s/leak, two research platforms that have very promising analysis features but are essentially useless in production work because most of effort in this type of reporting is import and cleaning, not analysis. See also my rant on the difficulties of applying NLP to journalism -- it's the import, stupid. OCCRP's Aleph is probably the closest we have, and I think Pudo would be the first to admit it's nowhere near enough. And I bet Palantir would work great, but...

george · 2017-02-21 23:30:00 UTC

Great response @jonathanstray.

As it happens, tonight I'm meeting up with someone much more well-versed in these worlds/tools than I.

The more I explore, the more I'm interested in the possibility of building a newsroom's 'rolodex' than I am building a bespoke investigation. (Though as you outline the potential is very much there as well.)

I feel like being able to exponentially grow a link/social network coupled with some of the more ethical aspects of social engineering could really add a lot of value in reporting power to a newsroom.

I'll be clicking through all your links to explore further.

Thanks.

pudo · 2017-02-22 14:13:19 UTC

Thanks for the shout-out. Two random comments:

1) Maltego - it's actually now been adopted by one of our senior reporters, and they're very very thrilled about it. Especially Paterva CaseFile, which is less focussed on the data enrichment aspect than the actual Maltego package.

2) Somehow, I found this OSINT package here quite inspiring: https://bitbucket.org/LaNMaSteR53/recon-ng - it's basically a whole bunch of sleuthing modules in social media that all write to a common core set of tables. While the code base itself is pretty horrific, I like the simplicity of the setup.

So I've been using a similar approach for two of our recent investigations and extracted the approach into a very prototypical package: https://github.com/occrp/corpint . This basically reads a seed of entities from a CSV file and then goes out to Orbis, OpenCorporates, Aleph and Wikidata to find additional details. Finally, there's a merging layer which prompts the user to confirm valid matches.

The resulting graph is loaded into Neo4J and explored using Linkurious. While it's too early to speak of massive investigative break-throughs, I like how quickly it let's me turn a list of 10 companies into a 1000 node graph

george · 2017-02-22 20:33:01 UTC

@pudo I'm looking at a use case that achieves something similar but with individuals. The end goal being to widen the 'attack' surface and avoid blind spots in reporting. It seems like Maltego should easily do this but the LinkedIn functionality seems limited to sending messages, not exploring links. recon-ng was also recently brought to my attention — i'll have to check it out.

thanks for the great responses, all.

george · 2017-02-22 21:15:12 UTC

It turns out recon-ng has a module called bing-linkedin-cache.py that seems to be vaguely what I'm after.

Video demo:

jonathanstray · 2017-03-01 02:46:50 UTC

This is a really interesting tool. The UI is horrifying, but there is a sort of genius underlying concept here, and I could express a lot of different methods with the grammar of this thing. No reporter could ever use it... and I'm not even sure I understand the "typical" reporting task well enough to know if this will really address the painful problems, though the emphasis on data collection and the many pre-fab searches are delightful.