DataShare
DataShare aims at allowing for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer to be sieved
into indexes and shared securely within a network of trusted individuals,
fostering unforeseen collaboration and prompting new and better investigations
that uncover corruption, transnational crime and abuse of power.
Current Features
An Extensible Multilingual Information Extraction and Search Platform
- Extract Text from Files;
- Extract Organizations, Persons and Locations from Text;
- Index and Search all
Multithreaded and Distributed Processings
Local or Remote Indexing
Installing and using
Using with elasticsearch
You can download the script datashare.sh and execute it. It will :
- download redis, elasticsearch and datashare docker containers
- initialize an elasticsearch index with datashare mapping
- provide CLI to run datashare extract, index, name finding tasks
- provide a WEB GUI to run datashare extract, index, name finding tasks, and search in the documents
To access web GUI, go in your documents folder and launch path/to/datashare.sh -w then connect datashare on http://localhost:8080
If you want to avoid synchronization of NLP models (offline use) then do export DS_JAVA_OPTS="-DDS_SYNC_NLP_MODELS=false" before launching the datashare.sh script.
Using only Named Entity Recognition
You can use the datashare docker container only for HTTP exposed name finding API.
Just run :
docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER -w
A bit of explanation :
-wtells datashare to run the webserver. It is launched on 8080 that's why the port is mapped for docker-m NERruns datashare without index at all on a stateless mode-v /path/to/dist:/home/datashare/distmaps the directory where the NLP models will be read (and downloaded if they don't exist)
Then query with curl the server with :
curl -i localhost:8080/ner/findNames/CORENLP --data-binary @path/to/a/file.txt
The last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.
Extract Text from Files
Implementations
-
TikaDocument from ICIJ/extract
Apache Tika v1.18 (Apache Licence v2.0)
with Tesseract v4.0 alpha
Support
Extract Persons, Organizations or Locations from Text
Implementations
-
org.icij.datashare.text.nlp.corenlp.CorenlpPipelineStanford CoreNLP v3.8.0,
(Conditional Random Fields),
Composite GPL v3+ -
org.icij.datashare.text.nlp.ixapipe.IxapipePipelineIxa Pipes Nerc v1.6.1,
(Perceptron),
Apache Licence v2.0 -
org.icij.datashare.text.nlp.mitie.MitiePipelineMIT Information Extraction v0.8,
(Structural Support Vector Machines),
Boost Software License v1.0 -
org.icij.datashare.text.nlp.opennlp.OpennlpPipelineApache OpenNLP v1.6.0,
(Maximum Entropy),
Apache Licence v2.0
Natural Language Processing Stages Support
NlpStage |
|---|
TOKEN |
SENTENCE |
POS |
NER |
Named Entity Recognition Language Support
NlpStage.NER |
ENGLISH |
SPANISH |
GERMAN |
FRENCH |
CHINESE |
|---|---|---|---|---|---|
NlpPipeline.Type.CORENLP |
X | X | X | (w/ EN) | X |
NlpPipeline.Type.OPENNLP |
X | X | - | X | - |
NlpPipeline.Type.IXAPIPE |
X | X | X | - | - |
NlpPipeline.Type.MITIE |
X | X | X | - | - |
Named Entity Categories Support
NamedEntity.Category |
|---|
ORGANIZATION |
PERSON |
LOCATION |
Parts-of-Speech Language Support
NlpStage.POS |
ENGLISH |
SPANISH |
GERMAN |
FRENCH |
|---|---|---|---|---|
NlpPipeline.Type.CORE |
X | X | X | X |
NlpPipeline.Type.OPEN |
X | X | X | X |
NlpPipeline.Type.IXA |
X | X | X | X |
NlpPipeline.Type.MITIE |
- | - | - | - |
Store and Search Documents and Named Entities
Implementations
-
org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexerElasticsearch v6.1.0, Apache Licence v2.0
Compilation / Build
From datashare root directory, type: mvn package
License
DataShare is released under the GNU Affero General Public License
Feedback
We welcome feedback as well as contributions!
For any bug, question, comment or (pull) request,
please contact us at engineering@icij.org
What's next
-
Data Sharing module
-
Networking module
-
Content Management module
-
User Management module
-
Request and Exchange Protocol
-
