GitHunt

UNIC: Unicode and Internationalization Crates for Rust

UNIC-logo

Travis
Rust-1.45+
Unicode-10.0.0
Release
Crates.io
Documentation
Gitter

https://github.com/open-i18n/rust-unic

UNIC is a project to develop components for the Rust programming language
to provide high-quality and easy-to-use crates for Unicode
and Internationalization data and algorithms. In other words, it's like
ICU for Rust, written completely in Rust, mostly
in safe mode, but also benefiting from performance gains of unsafe mode when
possible.

See UNIC Changelog for latest release details.

Project Goal

The goal for UNIC is to provide access to all levels of Unicode and
Internationalization functionalities, starting from Unicode character
properties, to Unicode algorithms for processing text, and more advanced
(locale-based) processes based on Unicode Common Locale Data Repository (CLDR).

Other standards and best practices, like IETF RFCs, are also implemented, as
needed by Unicode/CLDR components, or common demand.

Project Status

At the moment UNIC is under heavy development: the API is updated frequently on
master branch, and there will be API breakage between each 0.x release.
Please see open issues for changes
planed.

We expect to have the 1.0 version released in 2018 and maintain a stable API
afterwards, with possibly one or two API updates per year for the first couple
of years.

Design Goals

  1. Primary goal of UNIC is to provide reliable functionality by way of
    easy-to-use API. Therefore, new components are added may not be
    well-optimized for performance, but will have enough tests to show
    conformance to the standard, and examples to show users how they can be
    used to address common needs.

  2. Next major goal for UNIC components is performance and low binary and memory
    footprints. Specially, optimizing runtime for ASCII and other common cases
    will encourage adaptation without fear of slowing down regular development
    processes.

  3. Components are guaranteed, to the extend possible, to provide consistent
    data and algorithms. Cross-component tests are used to catch any
    inconsistency between implementations, without slowing down development
    processes.

Components and their Organization

UNIC Components have a hierarchical organization, starting from the
unic root, containing the major components. Each major component, in
turn, may host some minor components.

API of major components are designed for the end-users of the libraries, and
are expected to be extensively documented and accompanies with code examples.

In contrast to major components, minor components act as providers of data and
algorithms for the higher-level, and their API is expected to be more
performing, and possibly providing multiple ways of accessing the data.

The UNIC Super-Crate

The unic super-crate is a collection of all
UNIC (major) components, providing an easy way of access to all functionalities,
when all or many are needed, instead of importing components one-by-one. This
crate ensures all components imported are compatible in algorithms and
consistent data-wise.

Main code examples and cross-component integration tests are implemented under
this crate.

Major Components

Applications

Code Organization: Combined Repository

Some of the reasons to have a combined repository these components are:

  • Faster development. Implementing new Unicode/i18n components very often
    depends on other (lower level) components, which in turn may need
    adjustments—expose new API, fix bugs, etc—that can be developed, tested and
    reviewed in less cycles and shorter times.

  • Implementation Integrity. Multiple dependencies on other components
    mean that the components need to, to some level, agree with each other.
    Many Unicode algorithms, composed from smaller ones, assume that all parts
    of the algorithm is using the same version of Unicode data. Violation of
    this assumption can cause inconsistencies and hard-to-catch bugs. In a
    combined repository, it's possible to reach a better integrity during
    development, as well as with cross-component (integration) tests.

  • Pay for what you need. Small components (basic crates), which
    cross-depend only on what they need, allow users to only bring in what they
    consume in their project.

  • Shared bootstrapping. Considerable amount of extending Unicode/i18n
    functionalities depends on converting source Unicode/locale data into
    structured formats for the destination programming language. In a combined
    repository, it's easier to maintain these bootstrapping tools, expand
    coverage, and use better data structures for more efficiency.

Documentation

How to Use UNIC

In Cargo.toml:

[dependencies]
unic = "0.9.0"  # This has Unicode 10.0.0 data and algorithms

And in main.rs:

extern crate unic;

use unic::ucd::common::is_alphanumeric;
use unic::bidi::BidiInfo;
use unic::normal::StrNormalForm;
use unic::segment::{GraphemeIndices, Graphemes, WordBoundIndices, WordBounds, Words};
use unic::ucd::normal::compose;
use unic::ucd::{is_cased, Age, BidiClass, CharAge, CharBidiClass, StrBidiClass, UnicodeVersion};

fn main() {

    // Age

    assert_eq!(Age::of('A').unwrap().actual(), UnicodeVersion { major: 1, minor: 1, micro: 0 });
    assert_eq!(Age::of('\u{A0000}'), None);
    assert_eq!(
        Age::of('\u{10FFFF}').unwrap().actual(),
        UnicodeVersion { major: 2, minor: 0, micro: 0 }
    );

    if let Some(age) = '🦊'.age() {
        assert_eq!(age.actual().major, 9);
        assert_eq!(age.actual().minor, 0);
        assert_eq!(age.actual().micro, 0);
    }

    // Bidi

    let text = concat![
        "א",
        "ב",
        "ג",
        "a",
        "b",
        "c",
    ];

    assert!(!text.has_bidi_explicit());
    assert!(text.has_rtl());
    assert!(text.has_ltr());

    assert_eq!(text.chars().nth(0).unwrap().bidi_class(), BidiClass::RightToLeft);
    assert!(!text.chars().nth(0).unwrap().is_ltr());
    assert!(text.chars().nth(0).unwrap().is_rtl());

    assert_eq!(text.chars().nth(3).unwrap().bidi_class(), BidiClass::LeftToRight);
    assert!(text.chars().nth(3).unwrap().is_ltr());
    assert!(!text.chars().nth(3).unwrap().is_rtl());

    let bidi_info = BidiInfo::new(text, None);
    assert_eq!(bidi_info.paragraphs.len(), 1);

    let para = &bidi_info.paragraphs[0];
    assert_eq!(para.level.number(), 1);
    assert_eq!(para.level.is_rtl(), true);

    let line = para.range.clone();
    let display = bidi_info.reorder_line(para, line);
    assert_eq!(
        display,
        concat![
            "a",
            "b",
            "c",
            "ג",
            "ב",
            "א",
        ]
    );

    // Case

    assert_eq!(is_cased('A'), true);
    assert_eq!(is_cased('א'), false);

    // Normalization

    assert_eq!(compose('A', '\u{030A}'), Some('Å'));

    let s = "ÅΩ";
    let c = s.nfc().collect::<String>();
    assert_eq!(c, "ÅΩ");

    // Segmentation

    assert_eq!(
        Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
        &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
    );

    assert_eq!(
        Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
        &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
    );

    assert_eq!(
        GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
        &[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
    );

    assert_eq!(
        Words::new(
            "The quick (\"brown\") fox can't jump 32.3 feet, right?",
            |s: &&str| s.chars().any(is_alphanumeric),
        ).collect::<Vec<&str>>(),
        &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
    );

    assert_eq!(
        WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
        &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
    );

    assert_eq!(
        WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
        &[
            (0, "Brr"),
            (3, ","),
            (4, " "),
            (5, "it's"),
            (9, " "),
            (10, "29.3"),
            (14, "°"),
            (16, "F"),
            (17, "!")
        ]
    );
}

You can find more examples under examples and tests
directories. (And more to be added as UNIC expands...)

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
dual licensed as above, without any additional terms or conditions.

Code of Conduct

UNIC project follows The Rust Code of Conduct. You can find a copy of it in
CODE_OF_CONDUCT.md or online at
https://www.rust-lang.org/conduct.html.

open-i18n/rust-unic | GitHunt