Switch from Godep to go vendoring
This commit is contained in:
parent
6b37713bc0
commit
cd317761c5
1504 changed files with 263076 additions and 34441 deletions
202
vendor/github.com/blevesearch/segment/LICENSE
generated
vendored
Normal file
202
vendor/github.com/blevesearch/segment/LICENSE
generated
vendored
Normal file
|
@ -0,0 +1,202 @@
|
|||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
92
vendor/github.com/blevesearch/segment/README.md
generated
vendored
Normal file
92
vendor/github.com/blevesearch/segment/README.md
generated
vendored
Normal file
|
@ -0,0 +1,92 @@
|
|||
# segment
|
||||
|
||||
A Go library for performing Unicode Text Segmentation
|
||||
as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)
|
||||
|
||||
## Features
|
||||
|
||||
* Currently only segmentation at Word Boundaries is supported.
|
||||
|
||||
## License
|
||||
|
||||
Apache License Version 2.0
|
||||
|
||||
## Usage
|
||||
|
||||
The functionality is exposed in two ways:
|
||||
|
||||
1. You can use a bufio.Scanner with the SplitWords implementation of SplitFunc.
|
||||
The SplitWords function will identify the appropriate word boundaries in the input
|
||||
text and the Scanner will return tokens at the appropriate place.
|
||||
|
||||
scanner := bufio.NewScanner(...)
|
||||
scanner.Split(segment.SplitWords)
|
||||
for scanner.Scan() {
|
||||
tokenBytes := scanner.Bytes()
|
||||
}
|
||||
if err := scanner.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
2. Sometimes you would also like information returned about the type of token.
|
||||
To do this we have introduce a new type named Segmenter. It works just like Scanner
|
||||
but additionally a token type is returned.
|
||||
|
||||
segmenter := segment.NewWordSegmenter(...)
|
||||
for segmenter.Segment() {
|
||||
tokenBytes := segmenter.Bytes())
|
||||
tokenType := segmenter.Type()
|
||||
}
|
||||
if err := segmenter.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
## Choosing Implementation
|
||||
|
||||
By default segment does NOT use the fastest runtime implementation. The reason is that it adds approximately 5s to compilation time and may require more than 1GB of ram on the machine performing compilation.
|
||||
|
||||
However, you can choose to build with the fastest runtime implementation by passing the build tag as follows:
|
||||
|
||||
-tags 'prod'
|
||||
|
||||
## Generating Code
|
||||
|
||||
Several components in this package are generated.
|
||||
|
||||
1. Several Ragel rules files are generated from Unicode properties files.
|
||||
2. Ragel machine is generated from the Ragel rules.
|
||||
3. Test tables are generated from the Unicode test files.
|
||||
|
||||
All of these can be generated by running:
|
||||
|
||||
go generate
|
||||
|
||||
## Fuzzing
|
||||
|
||||
There is support for fuzzing the segment library with [go-fuzz](https://github.com/dvyukov/go-fuzz).
|
||||
|
||||
1. Install go-fuzz if you haven't already:
|
||||
|
||||
go get github.com/dvyukov/go-fuzz/go-fuzz
|
||||
go get github.com/dvyukov/go-fuzz/go-fuzz-build
|
||||
|
||||
2. Build the package with go-fuzz:
|
||||
|
||||
go-fuzz-build github.com/blevesearch/segment
|
||||
|
||||
3. Convert the Unicode provided test cases into the initial corpus for go-fuzz:
|
||||
|
||||
go test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate
|
||||
|
||||
4. Run go-fuzz:
|
||||
|
||||
go-fuzz -bin=segment-fuzz.zip -workdir=workdir
|
||||
|
||||
## Status
|
||||
|
||||
|
||||
[](https://travis-ci.org/blevesearch/segment)
|
||||
|
||||
[](https://coveralls.io/r/blevesearch/segment?branch=master)
|
||||
|
||||
[](https://godoc.org/github.com/blevesearch/segment)
|
45
vendor/github.com/blevesearch/segment/doc.go
generated
vendored
Normal file
45
vendor/github.com/blevesearch/segment/doc.go
generated
vendored
Normal file
|
@ -0,0 +1,45 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
/*
|
||||
Package segment is a library for performing Unicode Text Segmentation
|
||||
as described in Unicode Standard Annex #29 http://www.unicode.org/reports/tr29/
|
||||
|
||||
Currently only segmentation at Word Boundaries is supported.
|
||||
|
||||
The functionality is exposed in two ways:
|
||||
|
||||
1. You can use a bufio.Scanner with the SplitWords implementation of SplitFunc.
|
||||
The SplitWords function will identify the appropriate word boundaries in the input
|
||||
text and the Scanner will return tokens at the appropriate place.
|
||||
|
||||
scanner := bufio.NewScanner(...)
|
||||
scanner.Split(segment.SplitWords)
|
||||
for scanner.Scan() {
|
||||
tokenBytes := scanner.Bytes()
|
||||
}
|
||||
if err := scanner.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
2. Sometimes you would also like information returned about the type of token.
|
||||
To do this we have introduce a new type named Segmenter. It works just like Scanner
|
||||
but additionally a token type is returned.
|
||||
|
||||
segmenter := segment.NewWordSegmenter(...)
|
||||
for segmenter.Segment() {
|
||||
tokenBytes := segmenter.Bytes())
|
||||
tokenType := segmenter.Type()
|
||||
}
|
||||
if err := segmenter.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
*/
|
||||
package segment
|
20
vendor/github.com/blevesearch/segment/export_test.go
generated
vendored
Normal file
20
vendor/github.com/blevesearch/segment/export_test.go
generated
vendored
Normal file
|
@ -0,0 +1,20 @@
|
|||
// Copyright 2013 The Go Authors. All rights reserved.
|
||||
// Use of this source code is governed by a BSD-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package segment
|
||||
|
||||
// Exported for testing only.
|
||||
import (
|
||||
"unicode/utf8"
|
||||
)
|
||||
|
||||
func (s *Segmenter) MaxTokenSize(n int) {
|
||||
if n < utf8.UTFMax || n > 1e9 {
|
||||
panic("bad max token size")
|
||||
}
|
||||
if n < len(s.buf) {
|
||||
s.buf = make([]byte, n)
|
||||
}
|
||||
s.maxTokenSize = n
|
||||
}
|
219
vendor/github.com/blevesearch/segment/maketesttables.go
generated
vendored
Normal file
219
vendor/github.com/blevesearch/segment/maketesttables.go
generated
vendored
Normal file
|
@ -0,0 +1,219 @@
|
|||
// Copyright (c) 2015 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
// +build ignore
|
||||
|
||||
package main
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"bytes"
|
||||
"flag"
|
||||
"fmt"
|
||||
"io"
|
||||
"log"
|
||||
"net/http"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strconv"
|
||||
"strings"
|
||||
"unicode"
|
||||
)
|
||||
|
||||
var url = flag.String("url",
|
||||
"http://www.unicode.org/Public/"+unicode.Version+"/ucd/auxiliary/",
|
||||
"URL of Unicode database directory")
|
||||
var verbose = flag.Bool("verbose",
|
||||
false,
|
||||
"write data to stdout as it is parsed")
|
||||
var localFiles = flag.Bool("local",
|
||||
false,
|
||||
"data files have been copied to the current directory; for debugging only")
|
||||
|
||||
var outputFile = flag.String("output",
|
||||
"",
|
||||
"output file for generated tables; default stdout")
|
||||
|
||||
var output *bufio.Writer
|
||||
|
||||
func main() {
|
||||
flag.Parse()
|
||||
setupOutput()
|
||||
|
||||
graphemeTests := make([]test, 0)
|
||||
graphemeComments := make([]string, 0)
|
||||
graphemeTests, graphemeComments = loadUnicodeData("GraphemeBreakTest.txt", graphemeTests, graphemeComments)
|
||||
wordTests := make([]test, 0)
|
||||
wordComments := make([]string, 0)
|
||||
wordTests, wordComments = loadUnicodeData("WordBreakTest.txt", wordTests, wordComments)
|
||||
sentenceTests := make([]test, 0)
|
||||
sentenceComments := make([]string, 0)
|
||||
sentenceTests, sentenceComments = loadUnicodeData("SentenceBreakTest.txt", sentenceTests, sentenceComments)
|
||||
|
||||
fmt.Fprintf(output, fileHeader, *url)
|
||||
generateTestTables("Grapheme", graphemeTests, graphemeComments)
|
||||
generateTestTables("Word", wordTests, wordComments)
|
||||
generateTestTables("Sentence", sentenceTests, sentenceComments)
|
||||
|
||||
flushOutput()
|
||||
}
|
||||
|
||||
// WordBreakProperty.txt has the form:
|
||||
// 05F0..05F2 ; Hebrew_Letter # Lo [3] HEBREW LIGATURE YIDDISH DOUBLE VAV..HEBREW LIGATURE YIDDISH DOUBLE YOD
|
||||
// FB1D ; Hebrew_Letter # Lo HEBREW LETTER YOD WITH HIRIQ
|
||||
func openReader(file string) (input io.ReadCloser) {
|
||||
if *localFiles {
|
||||
f, err := os.Open(file)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
input = f
|
||||
} else {
|
||||
path := *url + file
|
||||
resp, err := http.Get(path)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
if resp.StatusCode != 200 {
|
||||
log.Fatal("bad GET status for "+file, resp.Status)
|
||||
}
|
||||
input = resp.Body
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
func loadUnicodeData(filename string, tests []test, comments []string) ([]test, []string) {
|
||||
f := openReader(filename)
|
||||
defer f.Close()
|
||||
bufioReader := bufio.NewReader(f)
|
||||
line, err := bufioReader.ReadString('\n')
|
||||
for err == nil {
|
||||
tests, comments = parseLine(line, tests, comments)
|
||||
line, err = bufioReader.ReadString('\n')
|
||||
}
|
||||
// if the err was EOF still need to process last value
|
||||
if err == io.EOF {
|
||||
tests, comments = parseLine(line, tests, comments)
|
||||
}
|
||||
return tests, comments
|
||||
}
|
||||
|
||||
const comment = "#"
|
||||
const brk = "÷"
|
||||
const nbrk = "×"
|
||||
|
||||
type test [][]byte
|
||||
|
||||
func parseLine(line string, tests []test, comments []string) ([]test, []string) {
|
||||
if strings.HasPrefix(line, comment) {
|
||||
return tests, comments
|
||||
}
|
||||
line = strings.TrimSpace(line)
|
||||
if len(line) == 0 {
|
||||
return tests, comments
|
||||
}
|
||||
commentStart := strings.Index(line, comment)
|
||||
comment := strings.TrimSpace(line[commentStart+1:])
|
||||
if commentStart > 0 {
|
||||
line = line[0:commentStart]
|
||||
}
|
||||
pieces := strings.Split(line, brk)
|
||||
t := make(test, 0)
|
||||
for _, piece := range pieces {
|
||||
piece = strings.TrimSpace(piece)
|
||||
if len(piece) > 0 {
|
||||
codePoints := strings.Split(piece, nbrk)
|
||||
word := ""
|
||||
for _, codePoint := range codePoints {
|
||||
codePoint = strings.TrimSpace(codePoint)
|
||||
r, err := strconv.ParseInt(codePoint, 16, 64)
|
||||
if err != nil {
|
||||
log.Printf("err: %v for '%s'", err, string(r))
|
||||
return tests, comments
|
||||
}
|
||||
|
||||
word += string(r)
|
||||
}
|
||||
t = append(t, []byte(word))
|
||||
}
|
||||
}
|
||||
tests = append(tests, t)
|
||||
comments = append(comments, comment)
|
||||
return tests, comments
|
||||
}
|
||||
|
||||
func generateTestTables(prefix string, tests []test, comments []string) {
|
||||
fmt.Fprintf(output, testHeader, prefix)
|
||||
for i, t := range tests {
|
||||
fmt.Fprintf(output, "\t\t{\n")
|
||||
fmt.Fprintf(output, "\t\t\tinput: %#v,\n", bytes.Join(t, []byte{}))
|
||||
fmt.Fprintf(output, "\t\t\toutput: %s,\n", generateTest(t))
|
||||
fmt.Fprintf(output, "\t\t\tcomment: `%s`,\n", comments[i])
|
||||
fmt.Fprintf(output, "\t\t},\n")
|
||||
}
|
||||
fmt.Fprintf(output, "}\n")
|
||||
}
|
||||
|
||||
func generateTest(t test) string {
|
||||
rv := "[][]byte{"
|
||||
for _, te := range t {
|
||||
rv += fmt.Sprintf("%#v,", te)
|
||||
}
|
||||
rv += "}"
|
||||
return rv
|
||||
}
|
||||
|
||||
const fileHeader = `// Generated by running
|
||||
// maketesttables --url=%s
|
||||
// DO NOT EDIT
|
||||
|
||||
package segment
|
||||
`
|
||||
|
||||
const testHeader = `var unicode%sTests = []struct {
|
||||
input []byte
|
||||
output [][]byte
|
||||
comment string
|
||||
}{
|
||||
`
|
||||
|
||||
func setupOutput() {
|
||||
output = bufio.NewWriter(startGofmt())
|
||||
}
|
||||
|
||||
// startGofmt connects output to a gofmt process if -output is set.
|
||||
func startGofmt() io.Writer {
|
||||
if *outputFile == "" {
|
||||
return os.Stdout
|
||||
}
|
||||
stdout, err := os.Create(*outputFile)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
// Pipe output to gofmt.
|
||||
gofmt := exec.Command("gofmt")
|
||||
fd, err := gofmt.StdinPipe()
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
gofmt.Stdout = stdout
|
||||
gofmt.Stderr = os.Stderr
|
||||
err = gofmt.Start()
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
return fd
|
||||
}
|
||||
|
||||
func flushOutput() {
|
||||
err := output.Flush()
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
}
|
330
vendor/github.com/blevesearch/segment/ragel/unicode2ragel.rb
generated
vendored
Normal file
330
vendor/github.com/blevesearch/segment/ragel/unicode2ragel.rb
generated
vendored
Normal file
|
@ -0,0 +1,330 @@
|
|||
#!/usr/bin/env ruby
|
||||
#
|
||||
# This scripted has been updated to accept more command-line arguments:
|
||||
#
|
||||
# -u, --url URL to process
|
||||
# -m, --machine Machine name
|
||||
# -p, --properties Properties to add to the machine
|
||||
# -o, --output Write output to file
|
||||
#
|
||||
# Updated by: Marty Schoch <marty.schoch@gmail.com>
|
||||
#
|
||||
# This script uses the unicode spec to generate a Ragel state machine
|
||||
# that recognizes unicode alphanumeric characters. It generates 5
|
||||
# character classes: uupper, ulower, ualpha, udigit, and ualnum.
|
||||
# Currently supported encodings are UTF-8 [default] and UCS-4.
|
||||
#
|
||||
# Usage: unicode2ragel.rb [options]
|
||||
# -e, --encoding [ucs4 | utf8] Data encoding
|
||||
# -h, --help Show this message
|
||||
#
|
||||
# This script was originally written as part of the Ferret search
|
||||
# engine library.
|
||||
#
|
||||
# Author: Rakan El-Khalil <rakan@well.com>
|
||||
|
||||
require 'optparse'
|
||||
require 'open-uri'
|
||||
|
||||
ENCODINGS = [ :utf8, :ucs4 ]
|
||||
ALPHTYPES = { :utf8 => "unsigned char", :ucs4 => "unsigned int" }
|
||||
DEFAULT_CHART_URL = "http://www.unicode.org/Public/5.1.0/ucd/DerivedCoreProperties.txt"
|
||||
DEFAULT_MACHINE_NAME= "WChar"
|
||||
|
||||
###
|
||||
# Display vars & default option
|
||||
|
||||
TOTAL_WIDTH = 80
|
||||
RANGE_WIDTH = 23
|
||||
@encoding = :utf8
|
||||
@chart_url = DEFAULT_CHART_URL
|
||||
machine_name = DEFAULT_MACHINE_NAME
|
||||
properties = []
|
||||
@output = $stdout
|
||||
|
||||
###
|
||||
# Option parsing
|
||||
|
||||
cli_opts = OptionParser.new do |opts|
|
||||
opts.on("-e", "--encoding [ucs4 | utf8]", "Data encoding") do |o|
|
||||
@encoding = o.downcase.to_sym
|
||||
end
|
||||
opts.on("-h", "--help", "Show this message") do
|
||||
puts opts
|
||||
exit
|
||||
end
|
||||
opts.on("-u", "--url URL", "URL to process") do |o|
|
||||
@chart_url = o
|
||||
end
|
||||
opts.on("-m", "--machine MACHINE_NAME", "Machine name") do |o|
|
||||
machine_name = o
|
||||
end
|
||||
opts.on("-p", "--properties x,y,z", Array, "Properties to add to machine") do |o|
|
||||
properties = o
|
||||
end
|
||||
opts.on("-o", "--output FILE", "output file") do |o|
|
||||
@output = File.new(o, "w+")
|
||||
end
|
||||
end
|
||||
|
||||
cli_opts.parse(ARGV)
|
||||
unless ENCODINGS.member? @encoding
|
||||
puts "Invalid encoding: #{@encoding}"
|
||||
puts cli_opts
|
||||
exit
|
||||
end
|
||||
|
||||
##
|
||||
# Downloads the document at url and yields every alpha line's hex
|
||||
# range and description.
|
||||
|
||||
def each_alpha( url, property )
|
||||
open( url ) do |file|
|
||||
file.each_line do |line|
|
||||
next if line =~ /^#/;
|
||||
next if line !~ /; #{property} #/;
|
||||
|
||||
range, description = line.split(/;/)
|
||||
range.strip!
|
||||
description.gsub!(/.*#/, '').strip!
|
||||
|
||||
if range =~ /\.\./
|
||||
start, stop = range.split '..'
|
||||
else start = stop = range
|
||||
end
|
||||
|
||||
yield start.hex .. stop.hex, description
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
###
|
||||
# Formats to hex at minimum width
|
||||
|
||||
def to_hex( n )
|
||||
r = "%0X" % n
|
||||
r = "0#{r}" unless (r.length % 2).zero?
|
||||
r
|
||||
end
|
||||
|
||||
###
|
||||
# UCS4 is just a straight hex conversion of the unicode codepoint.
|
||||
|
||||
def to_ucs4( range )
|
||||
rangestr = "0x" + to_hex(range.begin)
|
||||
rangestr << "..0x" + to_hex(range.end) if range.begin != range.end
|
||||
[ rangestr ]
|
||||
end
|
||||
|
||||
##
|
||||
# 0x00 - 0x7f -> 0zzzzzzz[7]
|
||||
# 0x80 - 0x7ff -> 110yyyyy[5] 10zzzzzz[6]
|
||||
# 0x800 - 0xffff -> 1110xxxx[4] 10yyyyyy[6] 10zzzzzz[6]
|
||||
# 0x010000 - 0x10ffff -> 11110www[3] 10xxxxxx[6] 10yyyyyy[6] 10zzzzzz[6]
|
||||
|
||||
UTF8_BOUNDARIES = [0x7f, 0x7ff, 0xffff, 0x10ffff]
|
||||
|
||||
def to_utf8_enc( n )
|
||||
r = 0
|
||||
if n <= 0x7f
|
||||
r = n
|
||||
elsif n <= 0x7ff
|
||||
y = 0xc0 | (n >> 6)
|
||||
z = 0x80 | (n & 0x3f)
|
||||
r = y << 8 | z
|
||||
elsif n <= 0xffff
|
||||
x = 0xe0 | (n >> 12)
|
||||
y = 0x80 | (n >> 6) & 0x3f
|
||||
z = 0x80 | n & 0x3f
|
||||
r = x << 16 | y << 8 | z
|
||||
elsif n <= 0x10ffff
|
||||
w = 0xf0 | (n >> 18)
|
||||
x = 0x80 | (n >> 12) & 0x3f
|
||||
y = 0x80 | (n >> 6) & 0x3f
|
||||
z = 0x80 | n & 0x3f
|
||||
r = w << 24 | x << 16 | y << 8 | z
|
||||
end
|
||||
|
||||
to_hex(r)
|
||||
end
|
||||
|
||||
def from_utf8_enc( n )
|
||||
n = n.hex
|
||||
r = 0
|
||||
if n <= 0x7f
|
||||
r = n
|
||||
elsif n <= 0xdfff
|
||||
y = (n >> 8) & 0x1f
|
||||
z = n & 0x3f
|
||||
r = y << 6 | z
|
||||
elsif n <= 0xefffff
|
||||
x = (n >> 16) & 0x0f
|
||||
y = (n >> 8) & 0x3f
|
||||
z = n & 0x3f
|
||||
r = x << 10 | y << 6 | z
|
||||
elsif n <= 0xf7ffffff
|
||||
w = (n >> 24) & 0x07
|
||||
x = (n >> 16) & 0x3f
|
||||
y = (n >> 8) & 0x3f
|
||||
z = n & 0x3f
|
||||
r = w << 18 | x << 12 | y << 6 | z
|
||||
end
|
||||
r
|
||||
end
|
||||
|
||||
###
|
||||
# Given a range, splits it up into ranges that can be continuously
|
||||
# encoded into utf8. Eg: 0x00 .. 0xff => [0x00..0x7f, 0x80..0xff]
|
||||
# This is not strictly needed since the current [5.1] unicode standard
|
||||
# doesn't have ranges that straddle utf8 boundaries. This is included
|
||||
# for completeness as there is no telling if that will ever change.
|
||||
|
||||
def utf8_ranges( range )
|
||||
ranges = []
|
||||
UTF8_BOUNDARIES.each do |max|
|
||||
if range.begin <= max
|
||||
return ranges << range if range.end <= max
|
||||
|
||||
ranges << range.begin .. max
|
||||
range = (max + 1) .. range.end
|
||||
end
|
||||
end
|
||||
ranges
|
||||
end
|
||||
|
||||
def build_range( start, stop )
|
||||
size = start.size/2
|
||||
left = size - 1
|
||||
return [""] if size < 1
|
||||
|
||||
a = start[0..1]
|
||||
b = stop[0..1]
|
||||
|
||||
###
|
||||
# Shared prefix
|
||||
|
||||
if a == b
|
||||
return build_range(start[2..-1], stop[2..-1]).map do |elt|
|
||||
"0x#{a} " + elt
|
||||
end
|
||||
end
|
||||
|
||||
###
|
||||
# Unshared prefix, end of run
|
||||
|
||||
return ["0x#{a}..0x#{b} "] if left.zero?
|
||||
|
||||
###
|
||||
# Unshared prefix, not end of run
|
||||
# Range can be 0x123456..0x56789A
|
||||
# Which is equivalent to:
|
||||
# 0x123456 .. 0x12FFFF
|
||||
# 0x130000 .. 0x55FFFF
|
||||
# 0x560000 .. 0x56789A
|
||||
|
||||
ret = []
|
||||
ret << build_range(start, a + "FF" * left)
|
||||
|
||||
###
|
||||
# Only generate middle range if need be.
|
||||
|
||||
if a.hex+1 != b.hex
|
||||
max = to_hex(b.hex - 1)
|
||||
max = "FF" if b == "FF"
|
||||
ret << "0x#{to_hex(a.hex+1)}..0x#{max} " + "0x00..0xFF " * left
|
||||
end
|
||||
|
||||
###
|
||||
# Don't generate last range if it is covered by first range
|
||||
|
||||
ret << build_range(b + "00" * left, stop) unless b == "FF"
|
||||
ret.flatten!
|
||||
end
|
||||
|
||||
def to_utf8( range )
|
||||
utf8_ranges( range ).map do |r|
|
||||
build_range to_utf8_enc(r.begin), to_utf8_enc(r.end)
|
||||
end.flatten!
|
||||
end
|
||||
|
||||
##
|
||||
# Perform a 3-way comparison of the number of codepoints advertised by
|
||||
# the unicode spec for the given range, the originally parsed range,
|
||||
# and the resulting utf8 encoded range.
|
||||
|
||||
def count_codepoints( code )
|
||||
code.split(' ').inject(1) do |acc, elt|
|
||||
if elt =~ /0x(.+)\.\.0x(.+)/
|
||||
if @encoding == :utf8
|
||||
acc * (from_utf8_enc($2) - from_utf8_enc($1) + 1)
|
||||
else
|
||||
acc * ($2.hex - $1.hex + 1)
|
||||
end
|
||||
else
|
||||
acc
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
def is_valid?( range, desc, codes )
|
||||
spec_count = 1
|
||||
spec_count = $1.to_i if desc =~ /\[(\d+)\]/
|
||||
range_count = range.end - range.begin + 1
|
||||
|
||||
sum = codes.inject(0) { |acc, elt| acc + count_codepoints(elt) }
|
||||
sum == spec_count and sum == range_count
|
||||
end
|
||||
|
||||
##
|
||||
# Generate the state maching to stdout
|
||||
|
||||
def generate_machine( name, property )
|
||||
pipe = " "
|
||||
@output.puts " #{name} = "
|
||||
each_alpha( @chart_url, property ) do |range, desc|
|
||||
|
||||
codes = (@encoding == :ucs4) ? to_ucs4(range) : to_utf8(range)
|
||||
|
||||
raise "Invalid encoding of range #{range}: #{codes.inspect}" unless
|
||||
is_valid? range, desc, codes
|
||||
|
||||
range_width = codes.map { |a| a.size }.max
|
||||
range_width = RANGE_WIDTH if range_width < RANGE_WIDTH
|
||||
|
||||
desc_width = TOTAL_WIDTH - RANGE_WIDTH - 11
|
||||
desc_width -= (range_width - RANGE_WIDTH) if range_width > RANGE_WIDTH
|
||||
|
||||
if desc.size > desc_width
|
||||
desc = desc[0..desc_width - 4] + "..."
|
||||
end
|
||||
|
||||
codes.each_with_index do |r, idx|
|
||||
desc = "" unless idx.zero?
|
||||
code = "%-#{range_width}s" % r
|
||||
@output.puts " #{pipe} #{code} ##{desc}"
|
||||
pipe = "|"
|
||||
end
|
||||
end
|
||||
@output.puts " ;"
|
||||
@output.puts ""
|
||||
end
|
||||
|
||||
@output.puts <<EOF
|
||||
# The following Ragel file was autogenerated with #{$0}
|
||||
# from: #{@chart_url}
|
||||
#
|
||||
# It defines #{properties}.
|
||||
#
|
||||
# To use this, make sure that your alphtype is set to #{ALPHTYPES[@encoding]},
|
||||
# and that your input is in #{@encoding}.
|
||||
|
||||
%%{
|
||||
machine #{machine_name};
|
||||
|
||||
EOF
|
||||
|
||||
properties.each { |x| generate_machine( x, x ) }
|
||||
|
||||
@output.puts <<EOF
|
||||
}%%
|
||||
EOF
|
101
vendor/github.com/blevesearch/segment/ragel/uscript.rl
generated
vendored
Normal file
101
vendor/github.com/blevesearch/segment/ragel/uscript.rl
generated
vendored
Normal file
|
@ -0,0 +1,101 @@
|
|||
# The following Ragel file was autogenerated with ragel/unicode2ragel.rb
|
||||
# from: http://www.unicode.org/Public/8.0.0/ucd/Scripts.txt
|
||||
#
|
||||
# It defines ["Hangul", "Han", "Hiragana"].
|
||||
#
|
||||
# To use this, make sure that your alphtype is set to unsigned char,
|
||||
# and that your input is in utf8.
|
||||
|
||||
%%{
|
||||
machine SCRIPTS;
|
||||
|
||||
Hangul =
|
||||
0xE1 0x84 0x80..0xFF #Lo [256] HANGUL CHOSEONG KIYEOK..HANGUL...
|
||||
| 0xE1 0x85..0x86 0x00..0xFF #
|
||||
| 0xE1 0x87 0x00..0xBF #
|
||||
| 0xE3 0x80 0xAE..0xAF #Mc [2] HANGUL SINGLE DOT TONE MARK..HANGU...
|
||||
| 0xE3 0x84 0xB1..0xFF #Lo [94] HANGUL LETTER KIYEOK..HANGUL L...
|
||||
| 0xE3 0x85..0x85 0x00..0xFF #
|
||||
| 0xE3 0x86 0x00..0x8E #
|
||||
| 0xE3 0x88 0x80..0x9E #So [31] PARENTHESIZED HANGUL KIYEOK..PAREN...
|
||||
| 0xE3 0x89 0xA0..0xBE #So [31] CIRCLED HANGUL KIYEOK..CIRCLED HAN...
|
||||
| 0xEA 0xA5 0xA0..0xBC #Lo [29] HANGUL CHOSEONG TIKEUT-MIEUM..HANG...
|
||||
| 0xEA 0xB0 0x80..0xFF #Lo [11172] HANGUL SYLLABLE GA..HA...
|
||||
| 0xEA 0xB1..0xFF 0x00..0xFF #
|
||||
| 0xEB..0xEC 0x00..0xFF 0x00..0xFF #
|
||||
| 0xED 0x00 0x00..0xFF #
|
||||
| 0xED 0x01..0x9D 0x00..0xFF #
|
||||
| 0xED 0x9E 0x00..0xA3 #
|
||||
| 0xED 0x9E 0xB0..0xFF #Lo [23] HANGUL JUNGSEONG O-YEO..HANGUL JUN...
|
||||
| 0xED 0x9F 0x00..0x86 #
|
||||
| 0xED 0x9F 0x8B..0xBB #Lo [49] HANGUL JONGSEONG NIEUN-RIEUL..HANG...
|
||||
| 0xEF 0xBE 0xA0..0xBE #Lo [31] HALFWIDTH HANGUL FILLER..HALFWIDTH...
|
||||
| 0xEF 0xBF 0x82..0x87 #Lo [6] HALFWIDTH HANGUL LETTER A..HALFWID...
|
||||
| 0xEF 0xBF 0x8A..0x8F #Lo [6] HALFWIDTH HANGUL LETTER YEO..HALFW...
|
||||
| 0xEF 0xBF 0x92..0x97 #Lo [6] HALFWIDTH HANGUL LETTER YO..HALFWI...
|
||||
| 0xEF 0xBF 0x9A..0x9C #Lo [3] HALFWIDTH HANGUL LETTER EU..HALFWI...
|
||||
;
|
||||
|
||||
Han =
|
||||
0xE2 0xBA 0x80..0x99 #So [26] CJK RADICAL REPEAT..CJK RADICAL RAP
|
||||
| 0xE2 0xBA 0x9B..0xFF #So [89] CJK RADICAL CHOKE..CJK RADICAL C-S...
|
||||
| 0xE2 0xBB 0x00..0xB3 #
|
||||
| 0xE2 0xBC 0x80..0xFF #So [214] KANGXI RADICAL ONE..KANGXI RAD...
|
||||
| 0xE2 0xBD..0xBE 0x00..0xFF #
|
||||
| 0xE2 0xBF 0x00..0x95 #
|
||||
| 0xE3 0x80 0x85 #Lm IDEOGRAPHIC ITERATION MARK
|
||||
| 0xE3 0x80 0x87 #Nl IDEOGRAPHIC NUMBER ZERO
|
||||
| 0xE3 0x80 0xA1..0xA9 #Nl [9] HANGZHOU NUMERAL ONE..HANGZHOU NUM...
|
||||
| 0xE3 0x80 0xB8..0xBA #Nl [3] HANGZHOU NUMERAL TEN..HANGZHOU NUM...
|
||||
| 0xE3 0x80 0xBB #Lm VERTICAL IDEOGRAPHIC ITERATION MARK
|
||||
| 0xE3 0x90 0x80..0xFF #Lo [6582] CJK UNIFIED IDEOGRAPH-3400..C...
|
||||
| 0xE3 0x91..0xFF 0x00..0xFF #
|
||||
| 0xE4 0x00 0x00..0xFF #
|
||||
| 0xE4 0x01..0xB5 0x00..0xFF #
|
||||
| 0xE4 0xB6 0x00..0xB5 #
|
||||
| 0xE4 0xB8 0x80..0xFF #Lo [20950] CJK UNIFIED IDEOGRAPH-...
|
||||
| 0xE4 0xB9..0xFF 0x00..0xFF #
|
||||
| 0xE5..0xE8 0x00..0xFF 0x00..0xFF #
|
||||
| 0xE9 0x00 0x00..0xFF #
|
||||
| 0xE9 0x01..0xBE 0x00..0xFF #
|
||||
| 0xE9 0xBF 0x00..0x95 #
|
||||
| 0xEF 0xA4 0x80..0xFF #Lo [366] CJK COMPATIBILITY IDEOGRAPH-F9...
|
||||
| 0xEF 0xA5..0xA8 0x00..0xFF #
|
||||
| 0xEF 0xA9 0x00..0xAD #
|
||||
| 0xEF 0xA9 0xB0..0xFF #Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA...
|
||||
| 0xEF 0xAA..0xAA 0x00..0xFF #
|
||||
| 0xEF 0xAB 0x00..0x99 #
|
||||
| 0xF0 0xA0 0x80 0x80..0xFF #Lo [42711] CJK UNIFIED IDEOG...
|
||||
| 0xF0 0xA0 0x81..0xFF 0x00..0xFF #
|
||||
| 0xF0 0xA1..0xA9 0x00..0xFF 0x00..0xFF #
|
||||
| 0xF0 0xAA 0x00 0x00..0xFF #
|
||||
| 0xF0 0xAA 0x01..0x9A 0x00..0xFF #
|
||||
| 0xF0 0xAA 0x9B 0x00..0x96 #
|
||||
| 0xF0 0xAA 0x9C 0x80..0xFF #Lo [4149] CJK UNIFIED IDEOGRAPH-2A...
|
||||
| 0xF0 0xAA 0x9D..0xFF 0x00..0xFF #
|
||||
| 0xF0 0xAB 0x00 0x00..0xFF #
|
||||
| 0xF0 0xAB 0x01..0x9B 0x00..0xFF #
|
||||
| 0xF0 0xAB 0x9C 0x00..0xB4 #
|
||||
| 0xF0 0xAB 0x9D 0x80..0xFF #Lo [222] CJK UNIFIED IDEOGRAPH-2B7...
|
||||
| 0xF0 0xAB 0x9E..0x9F 0x00..0xFF #
|
||||
| 0xF0 0xAB 0xA0 0x00..0x9D #
|
||||
| 0xF0 0xAB 0xA0 0xA0..0xFF #Lo [5762] CJK UNIFIED IDEOGRAPH-2B...
|
||||
| 0xF0 0xAB 0xA1..0xFF 0x00..0xFF #
|
||||
| 0xF0 0xAC 0x00 0x00..0xFF #
|
||||
| 0xF0 0xAC 0x01..0xB9 0x00..0xFF #
|
||||
| 0xF0 0xAC 0xBA 0x00..0xA1 #
|
||||
| 0xF0 0xAF 0xA0 0x80..0xFF #Lo [542] CJK COMPATIBILITY IDEOGRA...
|
||||
| 0xF0 0xAF 0xA1..0xA7 0x00..0xFF #
|
||||
| 0xF0 0xAF 0xA8 0x00..0x9D #
|
||||
;
|
||||
|
||||
Hiragana =
|
||||
0xE3 0x81 0x81..0xFF #Lo [86] HIRAGANA LETTER SMALL A..HIRAGANA ...
|
||||
| 0xE3 0x82 0x00..0x96 #
|
||||
| 0xE3 0x82 0x9D..0x9E #Lm [2] HIRAGANA ITERATION MARK..HIRAGANA ...
|
||||
| 0xE3 0x82 0x9F #Lo HIRAGANA DIGRAPH YORI
|
||||
| 0xF0 0x9B 0x80 0x81 #Lo HIRAGANA LETTER ARCHAIC YE
|
||||
| 0xF0 0x9F 0x88 0x80 #So SQUARE HIRAGANA HOKA
|
||||
;
|
||||
|
||||
}%%
|
1290
vendor/github.com/blevesearch/segment/ragel/uwb.rl
generated
vendored
Normal file
1290
vendor/github.com/blevesearch/segment/ragel/uwb.rl
generated
vendored
Normal file
File diff suppressed because it is too large
Load diff
284
vendor/github.com/blevesearch/segment/segment.go
generated
vendored
Normal file
284
vendor/github.com/blevesearch/segment/segment.go
generated
vendored
Normal file
|
@ -0,0 +1,284 @@
|
|||
// Copyright (c) 2015 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
package segment
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"io"
|
||||
)
|
||||
|
||||
// Autogenerate the following:
|
||||
// 1. Ragel rules from subset of Unicode script properties
|
||||
// 2. Ragel rules from Unicode word segmentation properties
|
||||
// 3. Ragel machine for word segmentation
|
||||
// 4. Test tables from Unicode
|
||||
//
|
||||
// Requires:
|
||||
// 1. Ruby (to generate ragel rules from unicode spec)
|
||||
// 2. Ragel (only v6.9 tested)
|
||||
// 3. sed (to rewrite build tags)
|
||||
//
|
||||
//go:generate ragel/unicode2ragel.rb -u http://www.unicode.org/Public/8.0.0/ucd/Scripts.txt -m SCRIPTS -p Hangul,Han,Hiragana -o ragel/uscript.rl
|
||||
//go:generate ragel/unicode2ragel.rb -u http://www.unicode.org/Public/8.0.0/ucd/auxiliary/WordBreakProperty.txt -m WB -p Double_Quote,Single_Quote,Hebrew_Letter,CR,LF,Newline,Extend,Format,Katakana,ALetter,MidLetter,MidNum,MidNumLet,Numeric,ExtendNumLet,Regional_Indicator -o ragel/uwb.rl
|
||||
//go:generate ragel -T1 -Z segment_words.rl -o segment_words.go
|
||||
//go:generate sed -i "" -e "s/BUILDTAGS/!prod/" segment_words.go
|
||||
//go:generate sed -i "" -e "s/RAGELFLAGS/-T1/" segment_words.go
|
||||
//go:generate ragel -G2 -Z segment_words.rl -o segment_words_prod.go
|
||||
//go:generate sed -i "" -e "s/BUILDTAGS/prod/" segment_words_prod.go
|
||||
//go:generate sed -i "" -e "s/RAGELFLAGS/-G2/" segment_words_prod.go
|
||||
//go:generate go run maketesttables.go -output tables_test.go
|
||||
|
||||
// NewWordSegmenter returns a new Segmenter to read from r.
|
||||
func NewWordSegmenter(r io.Reader) *Segmenter {
|
||||
return NewSegmenter(r)
|
||||
}
|
||||
|
||||
// NewWordSegmenterDirect returns a new Segmenter to work directly with buf.
|
||||
func NewWordSegmenterDirect(buf []byte) *Segmenter {
|
||||
return NewSegmenterDirect(buf)
|
||||
}
|
||||
|
||||
func SplitWords(data []byte, atEOF bool) (int, []byte, error) {
|
||||
advance, token, _, err := SegmentWords(data, atEOF)
|
||||
return advance, token, err
|
||||
}
|
||||
|
||||
func SegmentWords(data []byte, atEOF bool) (int, []byte, int, error) {
|
||||
vals := make([][]byte, 0, 1)
|
||||
types := make([]int, 0, 1)
|
||||
tokens, types, advance, err := segmentWords(data, 1, atEOF, vals, types)
|
||||
if len(tokens) > 0 {
|
||||
return advance, tokens[0], types[0], err
|
||||
}
|
||||
return advance, nil, 0, err
|
||||
}
|
||||
|
||||
func SegmentWordsDirect(data []byte, val [][]byte, types []int) ([][]byte, []int, int, error) {
|
||||
return segmentWords(data, -1, true, val, types)
|
||||
}
|
||||
|
||||
// *** Core Segmenter
|
||||
|
||||
const maxConsecutiveEmptyReads = 100
|
||||
|
||||
// NewSegmenter returns a new Segmenter to read from r.
|
||||
// Defaults to segment using SegmentWords
|
||||
func NewSegmenter(r io.Reader) *Segmenter {
|
||||
return &Segmenter{
|
||||
r: r,
|
||||
segment: SegmentWords,
|
||||
maxTokenSize: MaxScanTokenSize,
|
||||
buf: make([]byte, 4096), // Plausible starting size; needn't be large.
|
||||
}
|
||||
}
|
||||
|
||||
// NewSegmenterDirect returns a new Segmenter to work directly with buf.
|
||||
// Defaults to segment using SegmentWords
|
||||
func NewSegmenterDirect(buf []byte) *Segmenter {
|
||||
return &Segmenter{
|
||||
segment: SegmentWords,
|
||||
maxTokenSize: MaxScanTokenSize,
|
||||
buf: buf,
|
||||
start: 0,
|
||||
end: len(buf),
|
||||
err: io.EOF,
|
||||
}
|
||||
}
|
||||
|
||||
// Segmenter provides a convenient interface for reading data such as
|
||||
// a file of newline-delimited lines of text. Successive calls to
|
||||
// the Segment method will step through the 'tokens' of a file, skipping
|
||||
// the bytes between the tokens. The specification of a token is
|
||||
// defined by a split function of type SplitFunc; the default split
|
||||
// function breaks the input into lines with line termination stripped. Split
|
||||
// functions are defined in this package for scanning a file into
|
||||
// lines, bytes, UTF-8-encoded runes, and space-delimited words. The
|
||||
// client may instead provide a custom split function.
|
||||
//
|
||||
// Segmenting stops unrecoverably at EOF, the first I/O error, or a token too
|
||||
// large to fit in the buffer. When a scan stops, the reader may have
|
||||
// advanced arbitrarily far past the last token. Programs that need more
|
||||
// control over error handling or large tokens, or must run sequential scans
|
||||
// on a reader, should use bufio.Reader instead.
|
||||
//
|
||||
type Segmenter struct {
|
||||
r io.Reader // The reader provided by the client.
|
||||
segment SegmentFunc // The function to split the tokens.
|
||||
maxTokenSize int // Maximum size of a token; modified by tests.
|
||||
token []byte // Last token returned by split.
|
||||
buf []byte // Buffer used as argument to split.
|
||||
start int // First non-processed byte in buf.
|
||||
end int // End of data in buf.
|
||||
typ int // The token type
|
||||
err error // Sticky error.
|
||||
}
|
||||
|
||||
// SegmentFunc is the signature of the segmenting function used to tokenize the
|
||||
// input. The arguments are an initial substring of the remaining unprocessed
|
||||
// data and a flag, atEOF, that reports whether the Reader has no more data
|
||||
// to give. The return values are the number of bytes to advance the input
|
||||
// and the next token to return to the user, plus an error, if any. If the
|
||||
// data does not yet hold a complete token, for instance if it has no newline
|
||||
// while scanning lines, SegmentFunc can return (0, nil, nil) to signal the
|
||||
// Segmenter to read more data into the slice and try again with a longer slice
|
||||
// starting at the same point in the input.
|
||||
//
|
||||
// If the returned error is non-nil, segmenting stops and the error
|
||||
// is returned to the client.
|
||||
//
|
||||
// The function is never called with an empty data slice unless atEOF
|
||||
// is true. If atEOF is true, however, data may be non-empty and,
|
||||
// as always, holds unprocessed text.
|
||||
type SegmentFunc func(data []byte, atEOF bool) (advance int, token []byte, segmentType int, err error)
|
||||
|
||||
// Errors returned by Segmenter.
|
||||
var (
|
||||
ErrTooLong = errors.New("bufio.Segmenter: token too long")
|
||||
ErrNegativeAdvance = errors.New("bufio.Segmenter: SplitFunc returns negative advance count")
|
||||
ErrAdvanceTooFar = errors.New("bufio.Segmenter: SplitFunc returns advance count beyond input")
|
||||
)
|
||||
|
||||
const (
|
||||
// Maximum size used to buffer a token. The actual maximum token size
|
||||
// may be smaller as the buffer may need to include, for instance, a newline.
|
||||
MaxScanTokenSize = 64 * 1024
|
||||
)
|
||||
|
||||
// Err returns the first non-EOF error that was encountered by the Segmenter.
|
||||
func (s *Segmenter) Err() error {
|
||||
if s.err == io.EOF {
|
||||
return nil
|
||||
}
|
||||
return s.err
|
||||
}
|
||||
|
||||
func (s *Segmenter) Type() int {
|
||||
return s.typ
|
||||
}
|
||||
|
||||
// Bytes returns the most recent token generated by a call to Segment.
|
||||
// The underlying array may point to data that will be overwritten
|
||||
// by a subsequent call to Segment. It does no allocation.
|
||||
func (s *Segmenter) Bytes() []byte {
|
||||
return s.token
|
||||
}
|
||||
|
||||
// Text returns the most recent token generated by a call to Segment
|
||||
// as a newly allocated string holding its bytes.
|
||||
func (s *Segmenter) Text() string {
|
||||
return string(s.token)
|
||||
}
|
||||
|
||||
// Segment advances the Segmenter to the next token, which will then be
|
||||
// available through the Bytes or Text method. It returns false when the
|
||||
// scan stops, either by reaching the end of the input or an error.
|
||||
// After Segment returns false, the Err method will return any error that
|
||||
// occurred during scanning, except that if it was io.EOF, Err
|
||||
// will return nil.
|
||||
func (s *Segmenter) Segment() bool {
|
||||
// Loop until we have a token.
|
||||
for {
|
||||
// See if we can get a token with what we already have.
|
||||
if s.end > s.start {
|
||||
advance, token, typ, err := s.segment(s.buf[s.start:s.end], s.err != nil)
|
||||
if err != nil {
|
||||
s.setErr(err)
|
||||
return false
|
||||
}
|
||||
s.typ = typ
|
||||
if !s.advance(advance) {
|
||||
return false
|
||||
}
|
||||
s.token = token
|
||||
if token != nil {
|
||||
return true
|
||||
}
|
||||
}
|
||||
// We cannot generate a token with what we are holding.
|
||||
// If we've already hit EOF or an I/O error, we are done.
|
||||
if s.err != nil {
|
||||
// Shut it down.
|
||||
s.start = 0
|
||||
s.end = 0
|
||||
return false
|
||||
}
|
||||
// Must read more data.
|
||||
// First, shift data to beginning of buffer if there's lots of empty space
|
||||
// or space is needed.
|
||||
if s.start > 0 && (s.end == len(s.buf) || s.start > len(s.buf)/2) {
|
||||
copy(s.buf, s.buf[s.start:s.end])
|
||||
s.end -= s.start
|
||||
s.start = 0
|
||||
}
|
||||
// Is the buffer full? If so, resize.
|
||||
if s.end == len(s.buf) {
|
||||
if len(s.buf) >= s.maxTokenSize {
|
||||
s.setErr(ErrTooLong)
|
||||
return false
|
||||
}
|
||||
newSize := len(s.buf) * 2
|
||||
if newSize > s.maxTokenSize {
|
||||
newSize = s.maxTokenSize
|
||||
}
|
||||
newBuf := make([]byte, newSize)
|
||||
copy(newBuf, s.buf[s.start:s.end])
|
||||
s.buf = newBuf
|
||||
s.end -= s.start
|
||||
s.start = 0
|
||||
continue
|
||||
}
|
||||
// Finally we can read some input. Make sure we don't get stuck with
|
||||
// a misbehaving Reader. Officially we don't need to do this, but let's
|
||||
// be extra careful: Segmenter is for safe, simple jobs.
|
||||
for loop := 0; ; {
|
||||
n, err := s.r.Read(s.buf[s.end:len(s.buf)])
|
||||
s.end += n
|
||||
if err != nil {
|
||||
s.setErr(err)
|
||||
break
|
||||
}
|
||||
if n > 0 {
|
||||
break
|
||||
}
|
||||
loop++
|
||||
if loop > maxConsecutiveEmptyReads {
|
||||
s.setErr(io.ErrNoProgress)
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// advance consumes n bytes of the buffer. It reports whether the advance was legal.
|
||||
func (s *Segmenter) advance(n int) bool {
|
||||
if n < 0 {
|
||||
s.setErr(ErrNegativeAdvance)
|
||||
return false
|
||||
}
|
||||
if n > s.end-s.start {
|
||||
s.setErr(ErrAdvanceTooFar)
|
||||
return false
|
||||
}
|
||||
s.start += n
|
||||
return true
|
||||
}
|
||||
|
||||
// setErr records the first error encountered.
|
||||
func (s *Segmenter) setErr(err error) {
|
||||
if s.err == nil || s.err == io.EOF {
|
||||
s.err = err
|
||||
}
|
||||
}
|
||||
|
||||
// SetSegmenter sets the segment function for the Segmenter. If called, it must be
|
||||
// called before Segment.
|
||||
func (s *Segmenter) SetSegmenter(segmenter SegmentFunc) {
|
||||
s.segment = segmenter
|
||||
}
|
22
vendor/github.com/blevesearch/segment/segment_fuzz.go
generated
vendored
Normal file
22
vendor/github.com/blevesearch/segment/segment_fuzz.go
generated
vendored
Normal file
|
@ -0,0 +1,22 @@
|
|||
// Copyright (c) 2015 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
// +build gofuzz
|
||||
|
||||
package segment
|
||||
|
||||
func Fuzz(data []byte) int {
|
||||
|
||||
vals := make([][]byte, 0, 10000)
|
||||
types := make([]int, 0, 10000)
|
||||
if _, _, _, err := SegmentWordsDirect(data, vals, types); err != nil {
|
||||
return 0
|
||||
}
|
||||
return 1
|
||||
}
|
29
vendor/github.com/blevesearch/segment/segment_fuzz_test.go
generated
vendored
Normal file
29
vendor/github.com/blevesearch/segment/segment_fuzz_test.go
generated
vendored
Normal file
|
@ -0,0 +1,29 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
// +build gofuzz_generate
|
||||
|
||||
package segment
|
||||
|
||||
import (
|
||||
"io/ioutil"
|
||||
"os"
|
||||
"strconv"
|
||||
"testing"
|
||||
)
|
||||
|
||||
const fuzzPrefix = "workdir/corpus"
|
||||
|
||||
func TestGenerateWordSegmentFuzz(t *testing.T) {
|
||||
|
||||
os.MkdirAll(fuzzPrefix, 0777)
|
||||
for i, test := range unicodeWordTests {
|
||||
ioutil.WriteFile(fuzzPrefix+"/"+strconv.Itoa(i)+".txt", test.input, 0777)
|
||||
}
|
||||
}
|
241
vendor/github.com/blevesearch/segment/segment_test.go
generated
vendored
Normal file
241
vendor/github.com/blevesearch/segment/segment_test.go
generated
vendored
Normal file
|
@ -0,0 +1,241 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
package segment
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"bytes"
|
||||
"errors"
|
||||
"io"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// Tests borrowed from Scanner to test Segmenter
|
||||
|
||||
// slowReader is a reader that returns only a few bytes at a time, to test the incremental
|
||||
// reads in Scanner.Scan.
|
||||
type slowReader struct {
|
||||
max int
|
||||
buf io.Reader
|
||||
}
|
||||
|
||||
func (sr *slowReader) Read(p []byte) (n int, err error) {
|
||||
if len(p) > sr.max {
|
||||
p = p[0:sr.max]
|
||||
}
|
||||
return sr.buf.Read(p)
|
||||
}
|
||||
|
||||
// genLine writes to buf a predictable but non-trivial line of text of length
|
||||
// n, including the terminal newline and an occasional carriage return.
|
||||
// If addNewline is false, the \r and \n are not emitted.
|
||||
func genLine(buf *bytes.Buffer, lineNum, n int, addNewline bool) {
|
||||
buf.Reset()
|
||||
doCR := lineNum%5 == 0
|
||||
if doCR {
|
||||
n--
|
||||
}
|
||||
for i := 0; i < n-1; i++ { // Stop early for \n.
|
||||
c := 'a' + byte(lineNum+i)
|
||||
if c == '\n' || c == '\r' { // Don't confuse us.
|
||||
c = 'N'
|
||||
}
|
||||
buf.WriteByte(c)
|
||||
}
|
||||
if addNewline {
|
||||
if doCR {
|
||||
buf.WriteByte('\r')
|
||||
}
|
||||
buf.WriteByte('\n')
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
func wrapSplitFuncAsSegmentFuncForTesting(splitFunc bufio.SplitFunc) SegmentFunc {
|
||||
return func(data []byte, atEOF bool) (advance int, token []byte, typ int, err error) {
|
||||
typ = 0
|
||||
advance, token, err = splitFunc(data, atEOF)
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
// Test that the line segmenter errors out on a long line.
|
||||
func TestSegmentTooLong(t *testing.T) {
|
||||
const smallMaxTokenSize = 256 // Much smaller for more efficient testing.
|
||||
// Build a buffer of lots of line lengths up to but not exceeding smallMaxTokenSize.
|
||||
tmp := new(bytes.Buffer)
|
||||
buf := new(bytes.Buffer)
|
||||
lineNum := 0
|
||||
j := 0
|
||||
for i := 0; i < 2*smallMaxTokenSize; i++ {
|
||||
genLine(tmp, lineNum, j, true)
|
||||
j++
|
||||
buf.Write(tmp.Bytes())
|
||||
lineNum++
|
||||
}
|
||||
s := NewSegmenter(&slowReader{3, buf})
|
||||
// change to line segmenter for testing
|
||||
s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(bufio.ScanLines))
|
||||
s.MaxTokenSize(smallMaxTokenSize)
|
||||
j = 0
|
||||
for lineNum := 0; s.Segment(); lineNum++ {
|
||||
genLine(tmp, lineNum, j, false)
|
||||
if j < smallMaxTokenSize {
|
||||
j++
|
||||
} else {
|
||||
j--
|
||||
}
|
||||
line := tmp.Bytes()
|
||||
if !bytes.Equal(s.Bytes(), line) {
|
||||
t.Errorf("%d: bad line: %d %d\n%.100q\n%.100q\n", lineNum, len(s.Bytes()), len(line), s.Bytes(), line)
|
||||
}
|
||||
}
|
||||
err := s.Err()
|
||||
if err != ErrTooLong {
|
||||
t.Fatalf("expected ErrTooLong; got %s", err)
|
||||
}
|
||||
}
|
||||
|
||||
var testError = errors.New("testError")
|
||||
|
||||
// Test the correct error is returned when the split function errors out.
|
||||
func TestSegmentError(t *testing.T) {
|
||||
// Create a split function that delivers a little data, then a predictable error.
|
||||
numSplits := 0
|
||||
const okCount = 7
|
||||
errorSplit := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
|
||||
if atEOF {
|
||||
panic("didn't get enough data")
|
||||
}
|
||||
if numSplits >= okCount {
|
||||
return 0, nil, testError
|
||||
}
|
||||
numSplits++
|
||||
return 1, data[0:1], nil
|
||||
}
|
||||
// Read the data.
|
||||
const text = "abcdefghijklmnopqrstuvwxyz"
|
||||
buf := strings.NewReader(text)
|
||||
s := NewSegmenter(&slowReader{1, buf})
|
||||
// change to line segmenter for testing
|
||||
s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(errorSplit))
|
||||
var i int
|
||||
for i = 0; s.Segment(); i++ {
|
||||
if len(s.Bytes()) != 1 || text[i] != s.Bytes()[0] {
|
||||
t.Errorf("#%d: expected %q got %q", i, text[i], s.Bytes()[0])
|
||||
}
|
||||
}
|
||||
// Check correct termination location and error.
|
||||
if i != okCount {
|
||||
t.Errorf("unexpected termination; expected %d tokens got %d", okCount, i)
|
||||
}
|
||||
err := s.Err()
|
||||
if err != testError {
|
||||
t.Fatalf("expected %q got %v", testError, err)
|
||||
}
|
||||
}
|
||||
|
||||
// Test that Scan finishes if we have endless empty reads.
|
||||
type endlessZeros struct{}
|
||||
|
||||
func (endlessZeros) Read(p []byte) (int, error) {
|
||||
return 0, nil
|
||||
}
|
||||
|
||||
func TestBadReader(t *testing.T) {
|
||||
scanner := NewSegmenter(endlessZeros{})
|
||||
for scanner.Segment() {
|
||||
t.Fatal("read should fail")
|
||||
}
|
||||
err := scanner.Err()
|
||||
if err != io.ErrNoProgress {
|
||||
t.Errorf("unexpected error: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSegmentAdvanceNegativeError(t *testing.T) {
|
||||
errorSplit := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
|
||||
if atEOF {
|
||||
panic("didn't get enough data")
|
||||
}
|
||||
return -1, data[0:1], nil
|
||||
}
|
||||
// Read the data.
|
||||
const text = "abcdefghijklmnopqrstuvwxyz"
|
||||
buf := strings.NewReader(text)
|
||||
s := NewSegmenter(&slowReader{1, buf})
|
||||
// change to line segmenter for testing
|
||||
s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(errorSplit))
|
||||
s.Segment()
|
||||
err := s.Err()
|
||||
if err != ErrNegativeAdvance {
|
||||
t.Fatalf("expected %q got %v", testError, err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSegmentAdvanceTooFarError(t *testing.T) {
|
||||
errorSplit := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
|
||||
if atEOF {
|
||||
panic("didn't get enough data")
|
||||
}
|
||||
return len(data) + 10, data[0:1], nil
|
||||
}
|
||||
// Read the data.
|
||||
const text = "abcdefghijklmnopqrstuvwxyz"
|
||||
buf := strings.NewReader(text)
|
||||
s := NewSegmenter(&slowReader{1, buf})
|
||||
// change to line segmenter for testing
|
||||
s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(errorSplit))
|
||||
s.Segment()
|
||||
err := s.Err()
|
||||
if err != ErrAdvanceTooFar {
|
||||
t.Fatalf("expected %q got %v", testError, err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSegmentLongTokens(t *testing.T) {
|
||||
// Read the data.
|
||||
text := bytes.Repeat([]byte("abcdefghijklmnop"), 257)
|
||||
buf := strings.NewReader(string(text))
|
||||
s := NewSegmenter(&slowReader{1, buf})
|
||||
// change to line segmenter for testing
|
||||
s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(bufio.ScanLines))
|
||||
for s.Segment() {
|
||||
line := s.Bytes()
|
||||
if !bytes.Equal(text, line) {
|
||||
t.Errorf("expected %s, got %s", text, line)
|
||||
}
|
||||
}
|
||||
err := s.Err()
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error; got %s", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSegmentLongTokensDontDouble(t *testing.T) {
|
||||
// Read the data.
|
||||
text := bytes.Repeat([]byte("abcdefghijklmnop"), 257)
|
||||
buf := strings.NewReader(string(text))
|
||||
s := NewSegmenter(&slowReader{1, buf})
|
||||
// change to line segmenter for testing
|
||||
s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(bufio.ScanLines))
|
||||
s.MaxTokenSize(6144)
|
||||
for s.Segment() {
|
||||
line := s.Bytes()
|
||||
if !bytes.Equal(text, line) {
|
||||
t.Errorf("expected %s, got %s", text, line)
|
||||
}
|
||||
}
|
||||
err := s.Err()
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error; got %s", err)
|
||||
}
|
||||
}
|
19542
vendor/github.com/blevesearch/segment/segment_words.go
generated
vendored
Normal file
19542
vendor/github.com/blevesearch/segment/segment_words.go
generated
vendored
Normal file
File diff suppressed because it is too large
Load diff
285
vendor/github.com/blevesearch/segment/segment_words.rl
generated
vendored
Normal file
285
vendor/github.com/blevesearch/segment/segment_words.rl
generated
vendored
Normal file
|
@ -0,0 +1,285 @@
|
|||
// Copyright (c) 2015 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
// +build BUILDTAGS
|
||||
|
||||
package segment
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"unicode/utf8"
|
||||
)
|
||||
|
||||
var RagelFlags = "RAGELFLAGS"
|
||||
|
||||
var ParseError = fmt.Errorf("unicode word segmentation parse error")
|
||||
|
||||
// Word Types
|
||||
const (
|
||||
None = iota
|
||||
Number
|
||||
Letter
|
||||
Kana
|
||||
Ideo
|
||||
)
|
||||
|
||||
%%{
|
||||
machine s;
|
||||
write data;
|
||||
}%%
|
||||
|
||||
func segmentWords(data []byte, maxTokens int, atEOF bool, val [][]byte, types []int) ([][]byte, []int, int, error) {
|
||||
cs, p, pe := 0, 0, len(data)
|
||||
cap := maxTokens
|
||||
if cap < 0 {
|
||||
cap = 1000
|
||||
}
|
||||
if val == nil {
|
||||
val = make([][]byte, 0, cap)
|
||||
}
|
||||
if types == nil {
|
||||
types = make([]int, 0, cap)
|
||||
}
|
||||
|
||||
// added for scanner
|
||||
ts := 0
|
||||
te := 0
|
||||
act := 0
|
||||
eof := pe
|
||||
_ = ts // compiler not happy
|
||||
_ = te
|
||||
_ = act
|
||||
|
||||
// our state
|
||||
startPos := 0
|
||||
endPos := 0
|
||||
totalConsumed := 0
|
||||
%%{
|
||||
|
||||
include SCRIPTS "ragel/uscript.rl";
|
||||
include WB "ragel/uwb.rl";
|
||||
|
||||
action startToken {
|
||||
startPos = p
|
||||
}
|
||||
|
||||
action endToken {
|
||||
endPos = p
|
||||
}
|
||||
|
||||
action finishNumericToken {
|
||||
if !atEOF {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
|
||||
val = append(val, data[startPos:endPos+1])
|
||||
types = append(types, Number)
|
||||
totalConsumed = endPos+1
|
||||
if maxTokens > 0 && len(val) >= maxTokens {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
}
|
||||
|
||||
action finishHangulToken {
|
||||
if endPos+1 == pe && !atEOF {
|
||||
return val, types, totalConsumed, nil
|
||||
} else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
|
||||
val = append(val, data[startPos:endPos+1])
|
||||
types = append(types, Letter)
|
||||
totalConsumed = endPos+1
|
||||
if maxTokens > 0 && len(val) >= maxTokens {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
}
|
||||
|
||||
action finishKatakanaToken {
|
||||
if endPos+1 == pe && !atEOF {
|
||||
return val, types, totalConsumed, nil
|
||||
} else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
|
||||
val = append(val, data[startPos:endPos+1])
|
||||
types = append(types, Ideo)
|
||||
totalConsumed = endPos+1
|
||||
if maxTokens > 0 && len(val) >= maxTokens {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
}
|
||||
|
||||
action finishWordToken {
|
||||
if !atEOF {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
val = append(val, data[startPos:endPos+1])
|
||||
types = append(types, Letter)
|
||||
totalConsumed = endPos+1
|
||||
if maxTokens > 0 && len(val) >= maxTokens {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
}
|
||||
|
||||
action finishHanToken {
|
||||
if endPos+1 == pe && !atEOF {
|
||||
return val, types, totalConsumed, nil
|
||||
} else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
|
||||
val = append(val, data[startPos:endPos+1])
|
||||
types = append(types, Ideo)
|
||||
totalConsumed = endPos+1
|
||||
if maxTokens > 0 && len(val) >= maxTokens {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
}
|
||||
|
||||
action finishHiraganaToken {
|
||||
if endPos+1 == pe && !atEOF {
|
||||
return val, types, totalConsumed, nil
|
||||
} else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
|
||||
val = append(val, data[startPos:endPos+1])
|
||||
types = append(types, Ideo)
|
||||
totalConsumed = endPos+1
|
||||
if maxTokens > 0 && len(val) >= maxTokens {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
}
|
||||
|
||||
action finishNoneToken {
|
||||
lastPos := startPos
|
||||
for lastPos <= endPos {
|
||||
_, size := utf8.DecodeRune(data[lastPos:])
|
||||
lastPos += size
|
||||
}
|
||||
endPos = lastPos -1
|
||||
p = endPos
|
||||
|
||||
if endPos+1 == pe && !atEOF {
|
||||
return val, types, totalConsumed, nil
|
||||
} else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
// otherwise, consume this as well
|
||||
val = append(val, data[startPos:endPos+1])
|
||||
types = append(types, None)
|
||||
totalConsumed = endPos+1
|
||||
if maxTokens > 0 && len(val) >= maxTokens {
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
||||
}
|
||||
|
||||
HangulEx = Hangul ( Extend | Format )*;
|
||||
HebrewOrALetterEx = ( Hebrew_Letter | ALetter ) ( Extend | Format )*;
|
||||
NumericEx = Numeric ( Extend | Format )*;
|
||||
KatakanaEx = Katakana ( Extend | Format )*;
|
||||
MidLetterEx = ( MidLetter | MidNumLet | Single_Quote ) ( Extend | Format )*;
|
||||
MidNumericEx = ( MidNum | MidNumLet | Single_Quote ) ( Extend | Format )*;
|
||||
ExtendNumLetEx = ExtendNumLet ( Extend | Format )*;
|
||||
HanEx = Han ( Extend | Format )*;
|
||||
HiraganaEx = Hiragana ( Extend | Format )*;
|
||||
SingleQuoteEx = Single_Quote ( Extend | Format )*;
|
||||
DoubleQuoteEx = Double_Quote ( Extend | Format )*;
|
||||
HebrewLetterEx = Hebrew_Letter ( Extend | Format )*;
|
||||
RegionalIndicatorEx = Regional_Indicator ( Extend | Format )*;
|
||||
NLCRLF = Newline | CR | LF;
|
||||
OtherEx = ^(NLCRLF) ( Extend | Format )* ;
|
||||
|
||||
# UAX#29 WB8. Numeric × Numeric
|
||||
# WB11. Numeric (MidNum | MidNumLet | Single_Quote) × Numeric
|
||||
# WB12. Numeric × (MidNum | MidNumLet | Single_Quote) Numeric
|
||||
# WB13a. (ALetter | Hebrew_Letter | Numeric | Katakana | ExtendNumLet) × ExtendNumLet
|
||||
# WB13b. ExtendNumLet × (ALetter | Hebrew_Letter | Numeric | Katakana)
|
||||
#
|
||||
WordNumeric = ( ( ExtendNumLetEx )* NumericEx ( ( ( ExtendNumLetEx )* | MidNumericEx ) NumericEx )* ( ExtendNumLetEx )* ) >startToken @endToken;
|
||||
|
||||
# subset of the below for typing purposes only!
|
||||
WordHangul = ( HangulEx )+ >startToken @endToken;
|
||||
WordKatakana = ( KatakanaEx )+ >startToken @endToken;
|
||||
|
||||
# UAX#29 WB5. (ALetter | Hebrew_Letter) × (ALetter | Hebrew_Letter)
|
||||
# WB6. (ALetter | Hebrew_Letter) × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter)
|
||||
# WB7. (ALetter | Hebrew_Letter) (MidLetter | MidNumLet | Single_Quote) × (ALetter | Hebrew_Letter)
|
||||
# WB7a. Hebrew_Letter × Single_Quote
|
||||
# WB7b. Hebrew_Letter × Double_Quote Hebrew_Letter
|
||||
# WB7c. Hebrew_Letter Double_Quote × Hebrew_Letter
|
||||
# WB9. (ALetter | Hebrew_Letter) × Numeric
|
||||
# WB10. Numeric × (ALetter | Hebrew_Letter)
|
||||
# WB13. Katakana × Katakana
|
||||
# WB13a. (ALetter | Hebrew_Letter | Numeric | Katakana | ExtendNumLet) × ExtendNumLet
|
||||
# WB13b. ExtendNumLet × (ALetter | Hebrew_Letter | Numeric | Katakana)
|
||||
#
|
||||
# Marty -deviated here to allow for (ExtendNumLetEx x ExtendNumLetEx) part of 13a
|
||||
#
|
||||
Word = ( ( ExtendNumLetEx )* ( KatakanaEx ( ( ExtendNumLetEx )* KatakanaEx )*
|
||||
| ( HebrewLetterEx ( SingleQuoteEx | DoubleQuoteEx HebrewLetterEx )
|
||||
| NumericEx ( ( ( ExtendNumLetEx )* | MidNumericEx ) NumericEx )*
|
||||
| HebrewOrALetterEx ( ( ( ExtendNumLetEx )* | MidLetterEx ) HebrewOrALetterEx )*
|
||||
|ExtendNumLetEx
|
||||
)+
|
||||
)
|
||||
(
|
||||
( ExtendNumLetEx )+ ( KatakanaEx ( ( ExtendNumLetEx )* KatakanaEx )*
|
||||
| ( HebrewLetterEx ( SingleQuoteEx | DoubleQuoteEx HebrewLetterEx )
|
||||
| NumericEx ( ( ( ExtendNumLetEx )* | MidNumericEx ) NumericEx )*
|
||||
| HebrewOrALetterEx ( ( ( ExtendNumLetEx )* | MidLetterEx ) HebrewOrALetterEx )*
|
||||
)+
|
||||
)
|
||||
)* ExtendNumLetEx*) >startToken @endToken;
|
||||
|
||||
# UAX#29 WB14. Any ÷ Any
|
||||
WordHan = HanEx >startToken @endToken;
|
||||
WordHiragana = HiraganaEx >startToken @endToken;
|
||||
|
||||
WordExt = ( ( Extend | Format )* ) >startToken @endToken; # maybe plus not star
|
||||
|
||||
WordCRLF = (CR LF) >startToken @endToken;
|
||||
|
||||
WordCR = CR >startToken @endToken;
|
||||
|
||||
WordLF = LF >startToken @endToken;
|
||||
|
||||
WordNL = Newline >startToken @endToken;
|
||||
|
||||
WordRegional = (RegionalIndicatorEx+) >startToken @endToken;
|
||||
|
||||
Other = OtherEx >startToken @endToken;
|
||||
|
||||
main := |*
|
||||
WordNumeric => finishNumericToken;
|
||||
WordHangul => finishHangulToken;
|
||||
WordKatakana => finishKatakanaToken;
|
||||
Word => finishWordToken;
|
||||
WordHan => finishHanToken;
|
||||
WordHiragana => finishHiraganaToken;
|
||||
WordRegional =>finishNoneToken;
|
||||
WordCRLF => finishNoneToken;
|
||||
WordCR => finishNoneToken;
|
||||
WordLF => finishNoneToken;
|
||||
WordNL => finishNoneToken;
|
||||
WordExt => finishNoneToken;
|
||||
Other => finishNoneToken;
|
||||
*|;
|
||||
|
||||
write init;
|
||||
write exec;
|
||||
}%%
|
||||
|
||||
if cs < s_first_final {
|
||||
return val, types, totalConsumed, ParseError
|
||||
}
|
||||
|
||||
return val, types, totalConsumed, nil
|
||||
}
|
173643
vendor/github.com/blevesearch/segment/segment_words_prod.go
generated
vendored
Normal file
173643
vendor/github.com/blevesearch/segment/segment_words_prod.go
generated
vendored
Normal file
File diff suppressed because it is too large
Load diff
445
vendor/github.com/blevesearch/segment/segment_words_test.go
generated
vendored
Normal file
445
vendor/github.com/blevesearch/segment/segment_words_test.go
generated
vendored
Normal file
|
@ -0,0 +1,445 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
|
||||
// except in compliance with the License. You may obtain a copy of the License at
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
// Unless required by applicable law or agreed to in writing, software distributed under the
|
||||
// License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
|
||||
// either express or implied. See the License for the specific language governing permissions
|
||||
// and limitations under the License.
|
||||
|
||||
package segment
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"bytes"
|
||||
"reflect"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestAdhocSegmentsWithType(t *testing.T) {
|
||||
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output [][]byte
|
||||
outputStrings []string
|
||||
outputTypes []int
|
||||
}{
|
||||
{
|
||||
input: []byte("Now is the.\n End."),
|
||||
output: [][]byte{
|
||||
[]byte("Now"),
|
||||
[]byte(" "),
|
||||
[]byte(" "),
|
||||
[]byte("is"),
|
||||
[]byte(" "),
|
||||
[]byte("the"),
|
||||
[]byte("."),
|
||||
[]byte("\n"),
|
||||
[]byte(" "),
|
||||
[]byte("End"),
|
||||
[]byte("."),
|
||||
},
|
||||
outputStrings: []string{
|
||||
"Now",
|
||||
" ",
|
||||
" ",
|
||||
"is",
|
||||
" ",
|
||||
"the",
|
||||
".",
|
||||
"\n",
|
||||
" ",
|
||||
"End",
|
||||
".",
|
||||
},
|
||||
outputTypes: []int{
|
||||
Letter,
|
||||
None,
|
||||
None,
|
||||
Letter,
|
||||
None,
|
||||
Letter,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
Letter,
|
||||
None,
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("3.5"),
|
||||
output: [][]byte{
|
||||
[]byte("3.5"),
|
||||
},
|
||||
outputStrings: []string{
|
||||
"3.5",
|
||||
},
|
||||
outputTypes: []int{
|
||||
Number,
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("cat3.5"),
|
||||
output: [][]byte{
|
||||
[]byte("cat3.5"),
|
||||
},
|
||||
outputStrings: []string{
|
||||
"cat3.5",
|
||||
},
|
||||
outputTypes: []int{
|
||||
Letter,
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("c"),
|
||||
output: [][]byte{
|
||||
[]byte("c"),
|
||||
},
|
||||
outputStrings: []string{
|
||||
"c",
|
||||
},
|
||||
outputTypes: []int{
|
||||
Letter,
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("こんにちは世界"),
|
||||
output: [][]byte{
|
||||
[]byte("こ"),
|
||||
[]byte("ん"),
|
||||
[]byte("に"),
|
||||
[]byte("ち"),
|
||||
[]byte("は"),
|
||||
[]byte("世"),
|
||||
[]byte("界"),
|
||||
},
|
||||
outputStrings: []string{
|
||||
"こ",
|
||||
"ん",
|
||||
"に",
|
||||
"ち",
|
||||
"は",
|
||||
"世",
|
||||
"界",
|
||||
},
|
||||
outputTypes: []int{
|
||||
Ideo,
|
||||
Ideo,
|
||||
Ideo,
|
||||
Ideo,
|
||||
Ideo,
|
||||
Ideo,
|
||||
Ideo,
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("你好世界"),
|
||||
output: [][]byte{
|
||||
[]byte("你"),
|
||||
[]byte("好"),
|
||||
[]byte("世"),
|
||||
[]byte("界"),
|
||||
},
|
||||
outputStrings: []string{
|
||||
"你",
|
||||
"好",
|
||||
"世",
|
||||
"界",
|
||||
},
|
||||
outputTypes: []int{
|
||||
Ideo,
|
||||
Ideo,
|
||||
Ideo,
|
||||
Ideo,
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("サッカ"),
|
||||
output: [][]byte{
|
||||
[]byte("サッカ"),
|
||||
},
|
||||
outputStrings: []string{
|
||||
"サッカ",
|
||||
},
|
||||
outputTypes: []int{
|
||||
Ideo,
|
||||
},
|
||||
},
|
||||
// test for wb7b/wb7c
|
||||
{
|
||||
input: []byte(`א"א`),
|
||||
output: [][]byte{
|
||||
[]byte(`א"א`),
|
||||
},
|
||||
outputStrings: []string{
|
||||
`א"א`,
|
||||
},
|
||||
outputTypes: []int{
|
||||
Letter,
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
rv := make([][]byte, 0)
|
||||
rvstrings := make([]string, 0)
|
||||
rvtypes := make([]int, 0)
|
||||
segmenter := NewWordSegmenter(bytes.NewReader(test.input))
|
||||
// Set the split function for the scanning operation.
|
||||
for segmenter.Segment() {
|
||||
rv = append(rv, segmenter.Bytes())
|
||||
rvstrings = append(rvstrings, segmenter.Text())
|
||||
rvtypes = append(rvtypes, segmenter.Type())
|
||||
}
|
||||
if err := segmenter.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if !reflect.DeepEqual(rv, test.output) {
|
||||
t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.output, rv, test.input)
|
||||
}
|
||||
if !reflect.DeepEqual(rvstrings, test.outputStrings) {
|
||||
t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputStrings, rvstrings, test.input)
|
||||
}
|
||||
if !reflect.DeepEqual(rvtypes, test.outputTypes) {
|
||||
t.Fatalf("expeced:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputTypes, rvtypes, test.input)
|
||||
}
|
||||
}
|
||||
|
||||
// run same tests again with direct
|
||||
for _, test := range tests {
|
||||
rv := make([][]byte, 0)
|
||||
rvstrings := make([]string, 0)
|
||||
rvtypes := make([]int, 0)
|
||||
segmenter := NewWordSegmenterDirect(test.input)
|
||||
// Set the split function for the scanning operation.
|
||||
for segmenter.Segment() {
|
||||
rv = append(rv, segmenter.Bytes())
|
||||
rvstrings = append(rvstrings, segmenter.Text())
|
||||
rvtypes = append(rvtypes, segmenter.Type())
|
||||
}
|
||||
if err := segmenter.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if !reflect.DeepEqual(rv, test.output) {
|
||||
t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.output, rv, test.input)
|
||||
}
|
||||
if !reflect.DeepEqual(rvstrings, test.outputStrings) {
|
||||
t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputStrings, rvstrings, test.input)
|
||||
}
|
||||
if !reflect.DeepEqual(rvtypes, test.outputTypes) {
|
||||
t.Fatalf("expeced:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputTypes, rvtypes, test.input)
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestUnicodeSegments(t *testing.T) {
|
||||
|
||||
for _, test := range unicodeWordTests {
|
||||
rv := make([][]byte, 0)
|
||||
scanner := bufio.NewScanner(bytes.NewReader(test.input))
|
||||
// Set the split function for the scanning operation.
|
||||
scanner.Split(SplitWords)
|
||||
for scanner.Scan() {
|
||||
rv = append(rv, scanner.Bytes())
|
||||
}
|
||||
if err := scanner.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if !reflect.DeepEqual(rv, test.output) {
|
||||
t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s' comment: %s", test.output, rv, test.input, test.comment)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestUnicodeSegmentsSlowReader(t *testing.T) {
|
||||
|
||||
for i, test := range unicodeWordTests {
|
||||
rv := make([][]byte, 0)
|
||||
segmenter := NewWordSegmenter(&slowReader{1, bytes.NewReader(test.input)})
|
||||
for segmenter.Segment() {
|
||||
rv = append(rv, segmenter.Bytes())
|
||||
}
|
||||
if err := segmenter.Err(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if !reflect.DeepEqual(rv, test.output) {
|
||||
t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: %d '%s' comment: %s", test.output, rv, i, test.input, test.comment)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestWordSegmentLongInputSlowReader(t *testing.T) {
|
||||
// Read the data.
|
||||
text := bytes.Repeat([]byte("abcdefghijklmnop"), 26)
|
||||
buf := strings.NewReader(string(text) + " cat")
|
||||
s := NewSegmenter(&slowReader{1, buf})
|
||||
s.MaxTokenSize(6144)
|
||||
for s.Segment() {
|
||||
}
|
||||
err := s.Err()
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error; got '%s'", err)
|
||||
}
|
||||
finalWord := s.Text()
|
||||
if s.Text() != "cat" {
|
||||
t.Errorf("expected 'cat' got '%s'", finalWord)
|
||||
}
|
||||
}
|
||||
|
||||
func BenchmarkSplitWords(b *testing.B) {
|
||||
for i := 0; i < b.N; i++ {
|
||||
vals := make([][]byte, 0)
|
||||
scanner := bufio.NewScanner(bytes.NewReader(bleveWikiArticle))
|
||||
scanner.Split(SplitWords)
|
||||
for scanner.Scan() {
|
||||
vals = append(vals, scanner.Bytes())
|
||||
}
|
||||
if err := scanner.Err(); err != nil {
|
||||
b.Fatal(err)
|
||||
}
|
||||
if len(vals) != 3465 {
|
||||
b.Fatalf("expected 3465 tokens, got %d", len(vals))
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func BenchmarkWordSegmenter(b *testing.B) {
|
||||
|
||||
for i := 0; i < b.N; i++ {
|
||||
vals := make([][]byte, 0)
|
||||
types := make([]int, 0)
|
||||
segmenter := NewWordSegmenter(bytes.NewReader(bleveWikiArticle))
|
||||
for segmenter.Segment() {
|
||||
vals = append(vals, segmenter.Bytes())
|
||||
types = append(types, segmenter.Type())
|
||||
}
|
||||
if err := segmenter.Err(); err != nil {
|
||||
b.Fatal(err)
|
||||
}
|
||||
if vals == nil {
|
||||
b.Fatalf("expected non-nil vals")
|
||||
}
|
||||
if types == nil {
|
||||
b.Fatalf("expected non-nil types")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func BenchmarkWordSegmenterDirect(b *testing.B) {
|
||||
|
||||
for i := 0; i < b.N; i++ {
|
||||
vals := make([][]byte, 0)
|
||||
types := make([]int, 0)
|
||||
segmenter := NewWordSegmenterDirect(bleveWikiArticle)
|
||||
for segmenter.Segment() {
|
||||
vals = append(vals, segmenter.Bytes())
|
||||
types = append(types, segmenter.Type())
|
||||
}
|
||||
if err := segmenter.Err(); err != nil {
|
||||
b.Fatal(err)
|
||||
}
|
||||
if vals == nil {
|
||||
b.Fatalf("expected non-nil vals")
|
||||
}
|
||||
if types == nil {
|
||||
b.Fatalf("expected non-nil types")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func BenchmarkDirect(b *testing.B) {
|
||||
|
||||
for i := 0; i < b.N; i++ {
|
||||
vals := make([][]byte, 0, 10000)
|
||||
types := make([]int, 0, 10000)
|
||||
vals, types, _, err := SegmentWordsDirect(bleveWikiArticle, vals, types)
|
||||
if err != nil {
|
||||
b.Fatal(err)
|
||||
}
|
||||
if vals == nil {
|
||||
b.Fatalf("expected non-nil vals")
|
||||
}
|
||||
if types == nil {
|
||||
b.Fatalf("expected non-nil types")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
var bleveWikiArticle = []byte(`Boiling liquid expanding vapor explosion
|
||||
From Wikipedia, the free encyclopedia
|
||||
See also: Boiler explosion and Steam explosion
|
||||
|
||||
Flames subsequent to a flammable liquid BLEVE from a tanker. BLEVEs do not necessarily involve fire.
|
||||
|
||||
This article's tone or style may not reflect the encyclopedic tone used on Wikipedia. See Wikipedia's guide to writing better articles for suggestions. (July 2013)
|
||||
A boiling liquid expanding vapor explosion (BLEVE, /ˈblɛviː/ blev-ee) is an explosion caused by the rupture of a vessel containing a pressurized liquid above its boiling point.[1]
|
||||
Contents [hide]
|
||||
1 Mechanism
|
||||
1.1 Water example
|
||||
1.2 BLEVEs without chemical reactions
|
||||
2 Fires
|
||||
3 Incidents
|
||||
4 Safety measures
|
||||
5 See also
|
||||
6 References
|
||||
7 External links
|
||||
Mechanism[edit]
|
||||
|
||||
This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2013)
|
||||
There are three characteristics of liquids which are relevant to the discussion of a BLEVE:
|
||||
If a liquid in a sealed container is boiled, the pressure inside the container increases. As the liquid changes to a gas it expands - this expansion in a vented container would cause the gas and liquid to take up more space. In a sealed container the gas and liquid are not able to take up more space and so the pressure rises. Pressurized vessels containing liquids can reach an equilibrium where the liquid stops boiling and the pressure stops rising. This occurs when no more heat is being added to the system (either because it has reached ambient temperature or has had a heat source removed).
|
||||
The boiling temperature of a liquid is dependent on pressure - high pressures will yield high boiling temperatures, and low pressures will yield low boiling temperatures. A common simple experiment is to place a cup of water in a vacuum chamber, and then reduce the pressure in the chamber until the water boils. By reducing the pressure the water will boil even at room temperature. This works both ways - if the pressure is increased beyond normal atmospheric pressures, the boiling of hot water could be suppressed far beyond normal temperatures. The cooling system of a modern internal combustion engine is a real-world example.
|
||||
When a liquid boils it turns into a gas. The resulting gas takes up far more space than the liquid did.
|
||||
Typically, a BLEVE starts with a container of liquid which is held above its normal, atmospheric-pressure boiling temperature. Many substances normally stored as liquids, such as CO2, propane, and other similar industrial gases have boiling temperatures, at atmospheric pressure, far below room temperature. In the case of water, a BLEVE could occur if a pressurized chamber of water is heated far beyond the standard 100 °C (212 °F). That container, because the boiling water pressurizes it, is capable of holding liquid water at very high temperatures.
|
||||
If the pressurized vessel, containing liquid at high temperature (which may be room temperature, depending on the substance) ruptures, the pressure which prevents the liquid from boiling is lost. If the rupture is catastrophic, where the vessel is immediately incapable of holding any pressure at all, then there suddenly exists a large mass of liquid which is at very high temperature and very low pressure. This causes the entire volume of liquid to instantaneously boil, which in turn causes an extremely rapid expansion. Depending on temperatures, pressures and the substance involved, that expansion may be so rapid that it can be classified as an explosion, fully capable of inflicting severe damage on its surroundings.
|
||||
Water example[edit]
|
||||
Imagine, for example, a tank of pressurized liquid water held at 204.4 °C (400 °F). This tank would normally be pressurized to 1.7 MPa (250 psi) above atmospheric ("gauge") pressure. If the tank containing the water were to rupture, there would for a slight moment exist a volume of liquid water which would be
|
||||
at atmospheric pressure, and
|
||||
204.4 °C (400 °F).
|
||||
At atmospheric pressure the boiling point of water is 100 °C (212 °F) - liquid water at atmospheric pressure cannot exist at temperatures higher than 100 °C (212 °F). At that moment, the water would boil and turn to vapour explosively, and the 204.4 °C (400 °F) liquid water turned to gas would take up a lot more volume than it did as liquid, causing a vapour explosion. Such explosions can happen when the superheated water of a steam engine escapes through a crack in a boiler, causing a boiler explosion.
|
||||
BLEVEs without chemical reactions[edit]
|
||||
It is important to note that a BLEVE need not be a chemical explosion—nor does there need to be a fire—however if a flammable substance is subject to a BLEVE it may also be subject to intense heating, either from an external source of heat which may have caused the vessel to rupture in the first place or from an internal source of localized heating such as skin friction. This heating can cause a flammable substance to ignite, adding a secondary explosion caused by the primary BLEVE. While blast effects of any BLEVE can be devastating, a flammable substance such as propane can add significantly to the danger.
|
||||
Bleve explosion.svg
|
||||
While the term BLEVE is most often used to describe the results of a container of flammable liquid rupturing due to fire, a BLEVE can occur even with a non-flammable substance such as water,[2] liquid nitrogen,[3] liquid helium or other refrigerants or cryogens, and therefore is not usually considered a type of chemical explosion.
|
||||
Fires[edit]
|
||||
BLEVEs can be caused by an external fire near the storage vessel causing heating of the contents and pressure build-up. While tanks are often designed to withstand great pressure, constant heating can cause the metal to weaken and eventually fail. If the tank is being heated in an area where there is no liquid, it may rupture faster without the liquid to absorb the heat. Gas containers are usually equipped with relief valves that vent off excess pressure, but the tank can still fail if the pressure is not released quickly enough.[1] Relief valves are sized to release pressure fast enough to prevent the pressure from increasing beyond the strength of the vessel, but not so fast as to be the cause of an explosion. An appropriately sized relief valve will allow the liquid inside to boil slowly, maintaining a constant pressure in the vessel until all the liquid has boiled and the vessel empties.
|
||||
If the substance involved is flammable, it is likely that the resulting cloud of the substance will ignite after the BLEVE has occurred, forming a fireball and possibly a fuel-air explosion, also termed a vapor cloud explosion (VCE). If the materials are toxic, a large area will be contaminated.[4]
|
||||
Incidents[edit]
|
||||
The term "BLEVE" was coined by three researchers at Factory Mutual, in the analysis of an accident there in 1957 involving a chemical reactor vessel.[5]
|
||||
In August 1959 the Kansas City Fire Department suffered its largest ever loss of life in the line of duty, when a 25,000 gallon (95,000 litre) gas tank exploded during a fire on Southwest Boulevard killing five firefighters. This was the first time BLEVE was used to describe a burning fuel tank.[citation needed]
|
||||
Later incidents included the Cheapside Street Whisky Bond Fire in Glasgow, Scotland in 1960; Feyzin, France in 1966; Crescent City, Illinois in 1970; Kingman, Arizona in 1973; a liquid nitrogen tank rupture[6] at Air Products and Chemicals and Mobay Chemical Company at New Martinsville, West Virginia on January 31, 1978 [1];Texas City, Texas in 1978; Murdock, Illinois in 1983; San Juan Ixhuatepec, Mexico City in 1984; and Toronto, Ontario in 2008.
|
||||
Safety measures[edit]
|
||||
[icon] This section requires expansion. (July 2013)
|
||||
Some fire mitigation measures are listed under liquefied petroleum gas.
|
||||
See also[edit]
|
||||
Boiler explosion
|
||||
Expansion ratio
|
||||
Explosive boiling or phase explosion
|
||||
Rapid phase transition
|
||||
Viareggio train derailment
|
||||
2008 Toronto explosions
|
||||
Gas carriers
|
||||
Los Alfaques Disaster
|
||||
Lac-Mégantic derailment
|
||||
References[edit]
|
||||
^ Jump up to: a b Kletz, Trevor (March 1990). Critical Aspects of Safety and Loss Prevention. London: Butterworth–Heinemann. pp. 43–45. ISBN 0-408-04429-2.
|
||||
Jump up ^ "Temperature Pressure Relief Valves on Water Heaters: test, inspect, replace, repair guide". Inspect-ny.com. Retrieved 2011-07-12.
|
||||
Jump up ^ Liquid nitrogen BLEVE demo
|
||||
Jump up ^ "Chemical Process Safety" (PDF). Retrieved 2011-07-12.
|
||||
Jump up ^ David F. Peterson, BLEVE: Facts, Risk Factors, and Fallacies, Fire Engineering magazine (2002).
|
||||
Jump up ^ "STATE EX REL. VAPOR CORP. v. NARICK". Supreme Court of Appeals of West Virginia. 1984-07-12. Retrieved 2014-03-16.
|
||||
External links[edit]
|
||||
Look up boiling liquid expanding vapor explosion in Wiktionary, the free dictionary.
|
||||
Wikimedia Commons has media related to BLEVE.
|
||||
BLEVE Demo on YouTube — video of a controlled BLEVE demo
|
||||
huge explosions on YouTube — video of propane and isobutane BLEVEs from a train derailment at Murdock, Illinois (3 September 1983)
|
||||
Propane BLEVE on YouTube — video of BLEVE from the Toronto propane depot fire
|
||||
Moscow Ring Road Accident on YouTube - Dozens of LPG tank BLEVEs after a road accident in Moscow
|
||||
Kingman, AZ BLEVE — An account of the 5 July 1973 explosion in Kingman, with photographs
|
||||
Propane Tank Explosions — Description of circumstances required to cause a propane tank BLEVE.
|
||||
Analysis of BLEVE Events at DOE Sites - Details physics and mathematics of BLEVEs.
|
||||
HID - SAFETY REPORT ASSESSMENT GUIDE: Whisky Maturation Warehouses - The liquor is aged in wooden barrels that can suffer BLEVE.
|
||||
Categories: ExplosivesFirefightingFireTypes of fireGas technologiesIndustrial fires and explosions`)
|
11994
vendor/github.com/blevesearch/segment/tables_test.go
generated
vendored
Normal file
11994
vendor/github.com/blevesearch/segment/tables_test.go
generated
vendored
Normal file
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue