「英文词组」分词问题

資深大佬 : BarryLu 5

之前已经有一个提问了： https://www.v2ex.com/t/340752#reply13

关于这个问题搜索了挺久还不是很明白，于是开了这个帖子。

我想做英文词组分词（可能不这么叫），比如 I love New York，我希望分词出来的是 I / love / New York，而不是：I / love / New / York 。New York 分开原本的意思就变了。

中文分词有非常多的工具，比如结巴（ https://github.com/fxsjy/jieba ），但是找英文词组分词工具就非常难（我甚至不知道用什么单词去搜索，比如是 Tokenizer 、Chunking 、还是 text segmentation ），请问英文有没有比较方便可以直接分词的工具。

比如斯坦福的 stanza （ https://github.com/stanfordnlp/stanza ）可以用于分词。中文分词结果没问题，但是英文只是按照空格做分词。

text = """英国首相约翰逊 6 日晚因病情恶化。"""  zh_nlp = stanza.Pipeline('zh') doc = zh_nlp(text)  for sent in doc.sentences:   print("Sentence：" + sent.text) # 断句   print("Tokenize：" + ' '.join(token.text for token in sent.tokens)) # 中文分词

它的输出结果是分词后的结果，这没问题：

Tokenize：英国 首相 约翰逊 6 日 晚因 病情 恶化 ， 被 转入 重症 监护 室 治疗 。

但是英文分词：

import stanza  nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_no_ssplit=True) doc = nlp('This is a sentence.nnThis is a second. This is a third.') for i, sentence in enumerate(doc.sentences):     print(f'====== Sentence {i+1} tokens =======')     print(*[f'id: {token.id}ttext: {token.text}' for token in sentence.tokens], sep='n')

输出结果为：

====== Sentence 1 tokens ======= id: (1,) text: This id: (2,) text: is id: (3,) text: a id: (4,) text: sentence id: (5,) text: . ====== Sentence 2 tokens ======= id: (1,) text: This id: (2,) text: is id: (3,) text: a id: (4,) text: second id: (5,) text: . id: (6,) text: This id: (7,) text: is id: (8,) text: a id: (9,) text: third id: (10,) text: .

大佬有話說 (9)

資深大佬 : heiheidewo

自己写一个吧，一般分词是按双向最长匹配来的，你把 New York 当做一个词处理即可

資深大佬 : TimePPT

没明白，你举的 stanza 用 en pipeline 分出来哪里不对吗？

資深大佬 : chizuo

可以试试 nltk/spacy 这类库，一般以 word-level 为级别的分词，很难避免你说的这个问题。你可以试试 sub-word level 以及结合 named entity 、pos_tag 这类判断

以 nlp tokenizer segmenter 为关键词看看相关论文

主資深大佬 : BarryLu

@TimePPT 想说的是，Stanza 的英文分词只是按照空格分开的，但是中文是 “真正分词”。另外比如 Tensorflow Keras 的英文分词（ Tokenizer ）也只是按照空格分开，现成的，做到类似于中文分词的「英文分词」工具，没找到。。。

主資深大佬 : BarryLu

@heiheidewo 自己写

資深大佬 : TimePPT

@BarryLu 很难做，LS 也提到了，通过 NER 之类的可以做到一部分，或者直接拿词典匹配。但没法完全避免。
另外，其实中文分词颗粒度问题更难，根据业务需要要微调的。

資深大佬 : mxalbert1996

英文里这种情况本来就很少，其中很大一部分都是专有名词，其实没太大影响，而且现在 NLP 都是 RNN，能识别前后关系，就更无所谓了。

資深大佬 : jhdxr

其实你要的并不是分词，比如 I want to have a cup of green apple juice.
在你定义的『分词』中，green apple juice 是一个『词』还是多个词？

如果你认为那是一个词，那你可以考虑 syntax parsing
如果你认为那是多个词，那我猜你想要的其实是识别专有名词（请尝试分词中文：『南京市长江大桥』），可以考虑 NER/NEM

資深大佬 : Merlini

这种一般是需要知识图谱辅助的，或者就直接用训练好的 NER 。比如 Spacy 的 NER:
“`python
import spacy

nlp = spacy.load(“en_core_web_sm”)
doc = nlp(“Apple is looking at buying U.K. startup for $1 billion”)

for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
“`

https://spacy.io/usage/linguistic-features#named-entities-101 「英文词组」分词问题