Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
H
homework2_dialog_project
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
20220418012
homework2_dialog_project
Commits
680929d8
Commit
680929d8
authored
2 years ago
by
20220418012
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Upload New File
parent
8449dcc4
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
51 additions
and
0 deletions
+51
-0
NLU/data/extract_data.py
+51
-0
No files found.
NLU/data/extract_data.py
0 → 100644
View file @
680929d8
import
os
import
json
import
random
in_file
=
'/home/data/tmp/NLP_Course/Joint_NLU/data/train.json'
out_train_file
=
'/home/data/tmp/NLP_Course/Joint_NLU/data/train.tsv'
out_test_file
=
'/home/data/tmp/NLP_Course/Joint_NLU/data/test.tsv'
cls_vocab_file
=
'/home/data/tmp/NLP_Course/Joint_NLU/data/cls_vocab'
slot_vocab_file
=
'/home/data/tmp/NLP_Course/Joint_NLU/data/slot_vocab'
with
open
(
in_file
)
as
f
:
data
=
json
.
load
(
f
)
print
(
'{} lines read'
.
format
(
len
(
data
)))
cls_label
=
set
()
slot_label
=
set
()
total_data
=
[]
for
d
in
data
:
domain
=
d
[
'domain'
]
intent
=
d
[
'intent'
]
cls_label
.
add
(
'{}@{}'
.
format
(
domain
,
intent
))
slots
=
[
'o'
]
*
len
(
d
[
'text'
])
for
s
in
d
[
'slots'
]:
idx
=
d
[
'text'
]
.
index
(
d
[
'slots'
][
s
])
slots
[
idx
]
=
'B-'
+
s
for
i
in
range
(
idx
+
1
,
idx
+
len
(
d
[
'slots'
][
s
])):
slots
[
i
]
=
'I-'
+
s
slot_label
.
update
(
slots
)
total_data
.
append
((
'{}@{}
\t
{}
\t
{}'
.
format
(
domain
,
intent
,
d
[
'text'
],
' '
.
join
(
slots
)))
.
lower
())
random
.
shuffle
(
total_data
)
with
open
(
out_train_file
,
'w'
)
as
f
:
for
i
in
total_data
[:
2000
]:
print
(
i
,
file
=
f
)
with
open
(
out_test_file
,
'w'
)
as
f
:
for
i
in
total_data
[
2000
:]:
print
(
i
,
file
=
f
)
with
open
(
cls_vocab_file
,
'w'
)
as
f
:
for
i
in
cls_label
:
print
(
i
,
file
=
f
)
with
open
(
slot_vocab_file
,
'w'
)
as
f
:
for
i
in
slot_label
:
print
(
i
,
file
=
f
)
print
(
'fin.'
)
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment