Extracting tables from a DOCX Word document in python
This triggers «XPathEvalError: Undefined namespace prefix» error. I’m sure it’s just the first one to expect while developing the script. Unfortunately, I couldn’t find a tutorial for python-docx. Could you kindly provide an example of table extraction?
You should look for a python xml tutorial. The opendocx function returns an xml document, according to the source. The rest of the functions in the python-docx library are wrappers around the lxml python library, at lxml.de or so it seems to me.
@Spencer Rathbun: The following code doesn’t raise error, but strangely yields [] tableList = document.xpath(‘//tbl’) print tableList Is this some DOCX feature that I don’t understand?
@mgierdal that looks like a correct result. It’s searching through the xml tree for tbl and not finding it. So, your result set is empty. I’d suggest print document to get out your entire xml and see if you actually have the tag you expect. If so, then your function call is incorrect in some way. Find the docs on the xpath function in the lxml library, and see if you’ve got a malformed command.
@Spencer Rathbun: Printing with print etree.tostring(document) shows
3 Answers 3
After some back and forth, we found out that a namespace was needed for this to work correctly. The xpath method is the appropriate solution, it just needs to have the document namespace passed in first.
The lxml xpath method has the details for namespace stuff. Look down the page in the link for passing a namespaces dictionary, and other details.
As explained by mgierdal in his comment above:
tblList = document.xpath(‘//w:tbl’, namespaces=document.nsmap) works like a dream. So, as I understand it w: is a shorthand that has to be expanded to the full namespace name, and the dictionary for that is provided by document.nsmap.
Working with tables in Python-Docx
For complete list of table styles, view on python-docx.
Create Table
Lets Create a table with 2 rows and 2 colums using add_table method. Table style can be defined using style argument which in this case is Table Grid.
from docx import Document # create document doc = Document() # add grid table table = doc.add_table(rows=2, cols=2, style="Table Grid")
Now table rows and columns can be accessed using table.rows or table.columns attribute respectively and then we can access each cell in that row using row.cells . Lets add headings to 1st row.
# access first row's cells heading_row = table.rows[0].cells # add headings heading_row[0].text = "Name" heading_row[1].text = "Marks"
Same way we can add other info to rows
# access second row's cells data_row = table.rows[1].cells # add headings data_row[0].text = "Ali" data_row[1].text = "68"
We can also add rows after table creation using add_row method which adds a row in table and we can add data to that row.
# add new row to table data_row = table.add_row().cells # add headings data_row[0].text = "Bilal" data_row[1].text = "26"
It creates a table with 3 rows and 2 columns with information.
Cell Margin
We can use OpenXML to modify and set any style using python-docx. Here we can specify margin to any cell where we can add or completely remove a cell margin.
from docx.oxml.shared import OxmlElement from docx.oxml.ns import qn def set_cell_margins(cell, **kwargs): """ cell: actual cell instance you want to modify usage: set_cell_margins(cell, top=50, start=50, bottom=50, end=50) provided values are in twentieths of a point (1/1440 of an inch). read more here: http://officeopenxml.com/WPtableCellMargins.php """ tc = cell._tc tcPr = tc.get_or_add_tcPr() tcMar = OxmlElement('w:tcMar') for m in ["top", "start", "bottom", "end"]: if m in kwargs: node = OxmlElement("w:<>".format(m)) node.set(qn('w:w'), str(kwargs.get(m))) node.set(qn('w:type'), 'dxa') tcMar.append(node) tcPr.append(tcMar)
Now we can add margin in any cell to increase space.
# access second row's cells data_row = table.add_row().cells set_cell_margins(data_row[0], top=100, start=100, bottom=100, end=50) # add headings data_row[0].text = "Usman" data_row[1].text = "76"
Nested Table
We can also created nested tables where we can add a table to a cell. For example, if for 1 person, we want to add marks for multiple subjects. We can add a table to parent table cells.
marks = # add new row data_row = table.add_row().cells # name of person data_row[0].text = "Nasir" # We add a table to 2nd cell with rows equal to our entries (3) marks_table = data_row[1].add_table(rows=len(marks), cols=2)
Now we can iterate over values and add to table.
for i, row in enumerate(marks.items()): # iterate over 3 values marks_table.rows[i].cells[0].text = row[0] # sub table first cell marks_table.rows[i].cells[1].text = str(row[1]) # second cell
We can also show images inside tables.
Table Images
In table paragraphs, we can add images to table. Here is a simple example to add two images to table. Here we get paragraph for a cell where we want to display image and then use add_picture method to add an image from path. and we also specify height and width in Inches.
from docx.shared import Inches, Cm doc = Document() # create doc doc.add_heading('Images in Table', 0) # add heading # create table with two rows and columns table = doc.add_table(rows=2, cols=2, style="Table Grid") # add first image with text table.rows[0].cells[0].text = 'This is image text inside table for 1st image.' # add image to table paragraph = table.rows[0].cells[1].paragraphs[0] run = paragraph.add_run() run.add_picture('image-1.jpg', width=Inches(3), height=Inches(1.5)) table.rows[1].cells[0].text = 'This is image text inside table for 2nd image.' # add image to table paragraph = table.rows[1].cells[1].paragraphs[0] run = paragraph.add_run() run.add_picture('image-2.jpg', width=Inches(3), height=Inches(1.5)) # save to file doc.save("images-table.docx")
So it creates a table with text in 1st column and images in 2nd column for each row.
For more info to working with images, view my next post or view python-docx documentation.
View can add data from csv file, text file or any other file to tables in docx. So for more info view python docx documentation.
Parsing of table from .docx file [closed]
I want to parse a table from a .docx file using Python and python-docx into some useful data structure. The .docx file contains only a single table in my case. I’ve uploaded it so you can have a look. Here’s a screenshot:
Post code and relevant materials here, not on some 3rd party site, and especially not with some shortened URL to some unknown website.
I have tried a lot of ways to parse it but, didn’t get anything to work — so didn’t pasted code. I don’t think it is useful if code is not working
1 Answer 1
You can use the snippet below to parse your document into a list where each row is a dictionary mapping the table header value to the column value.
from docx.api import Document # Load the first table from your document. In your example file, # there is only one table, so I just grab the first one. document = Document('Books.docx') table = document.tables[0] # Data will be a list of rows represented as dictionaries # containing each row's data. data = [] keys = None for i, row in enumerate(table.rows): text = (cell.text for cell in row.cells) # Establish the mapping based on the first row # headers; these will become the keys of our dictionary if i == 0: keys = tuple(text) continue # Construct a dictionary for this row, mapping # keys to values for this row row_data = dict(zip(keys, text)) data.append(row_data)
If you’d just want a tuple for each row, you should instead of creating a dictionary just set row_data to the tuple value of text , so in the loop instead of constructing the dict , do:
# Construct a tuple for this row row_data = tuple(text) data.append(row_data)
Now, data would hold something like this instead:
data = [ (u'1', u'Six Thinking Hats', u'Edward de BONO', u'Penguin Books' ), . ]
Then you can skip constructing keys , obviously (but still skip the first row!).
python -docx to extract table from word docx
I know this is a repeated question but the other answers did not work for me. I have a word file that consists of one table. I want that table as an output of my python program. I’m using python 3.6 and I have installed python -docx as well. Here is my code for the data extraction
from docx.api import Document document = Document('test_word.docx') table = document.tables[0] data = [] keys = None for i, row in enumerate(table.rows): text = (cell.text for cell in row.cells) if i == 0: keys = tuple(text) continue row_data = dict(zip(keys, text)) data.append(row_data) print (data)
This question is useful in linking search engine query [extract tables from docx] to existence of great python-docx package.
1 Answer 1
Your code works fine for me. How about inserting it into a dataframe?
import pandas as pd from docx.api import Document document = Document('test_word.docx') table = document.tables[0] data = [] keys = None for i, row in enumerate(table.rows): text = (cell.text for cell in row.cells) if i == 0: keys = tuple(text) continue row_data = dict(zip(keys, text)) data.append(row_data) print (data) df = pd.DataFrame(data)
How can i display particular row and column in that table? We can extract rows and cols based on index with iloc
# iloc[row,columns] df.iloc[0,:].tolist() # [5,6,7,8] - row index 0 df.iloc[:,0].tolist() # [5,9,13,17] - column index 0 df.iloc[0,0] # 5 - cell(0,0) df.iloc[1:,2].tolist() # [11,15,19] - column index 2, but skip first row
However, if your columns have names (in this case it is numbers) you can do it like this:
#df["name"].tolist() df[1].tolist() # [5,6,7,8] - column with name 1
prints, which is how the table looks like in my sample doc.
1 2 3 4 0 5 6 7 8 1 9 10 11 12 2 13 14 15 16 3 17 18 19 20